R tip #2: Learn dataframe subsetting in R in 5 minutes

To subset a dataframe: to apply conditions to extract a subset of the data that is in the dataframe.

Example: a company has employees listed in a dataframe and the employees work in either Texas or California. You may want to subset the dataframe to get only employees in Texas.

There are several ways to subset dataframes in R. There is a well established and very flexible system in base R. The “tidyverse” (specifically dyplyr) offers another approach, which is also quite flexible.
Although the tidyverse approach continues to grow in popularity, most online examples of R code use traditional subsetting.
Traditional subsetting of dataframes in R can seem bizarre. Every year, I find that students struggle with this.

In this post, I’ll talk about the logic of subsetting in a few steps.

LOAD DATA

data(mtcars)

IMPORTANT: for mtcars, the car name is actually the row name. So the car name is not a column! (But you can always get car names by executing rownames(mtcars))

Let’s look at the rownames and the columnnames

rownames(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

0. Remember: R numbering starts with 1, not zero!

1. Fundamental syntax of subsetting

We subset mtcars by setting a row condition and a column condition.

mtcars[row_condition,column_condition]

row_condition: an expression telling us how to filter the rows. For example, we may want rows 5,8, and 12.
column_condition: an expression telling us how to filter the columns. For example, we may want only the columns 1 and 4.

2. Simple subsetting by number

I want data from row 7.
The row_condition is 7.
The column_condition is everything, which is represented just with no character or a space.

#this option - preferred formatting with space
mtcars[7, ]
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
#this gives same output (no space used in statement)
mtcars[7,]
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4

Second example:
To get just column 7,

mtcars[ ,7]
##  [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90
## [12] 17.40 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87
## [23] 17.30 15.41 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60

3. More advanced subsetting by multiple numbers

I want data from rows 7 and 12.
I want columns 3 and 8.
To get rows 7 and 12, we set the row_condition to the vector c(7,12).
To get columns 3 and 8, we set the column_condition to the vector c(3,8).

mtcars[c(7,12),c(3,8)]
##             disp vs
## Duster 360 360.0  0
## Merc 450SE 275.8  0

4. Simple true and false subsetting

If the value is TRUE we keep the row.
Let’s make a simple subset of the mtcars dataframe to demonstrate this just using our numeric subsetting; call it mtsmall.

#make a small dataframe with first 3 rows and first three columns; this does not use true/false approach
mtsmall <- mtcars[c(1,2,3),c(1,2,3,4)]
mtsmall
##                mpg cyl disp  hp
## Mazda RX4     21.0   6  160 110
## Mazda RX4 Wag 21.0   6  160 110
## Datsun 710    22.8   4  108  93

What if we want just rows 1 and 3?
We can use c(TRUE, FALSE, TRUE) for the row_condition.

#only rows 1 and 3 and ALL columns (from reduced dataframe)
mtsmall[c(TRUE,FALSE, TRUE), ]
##             mpg cyl disp  hp
## Mazda RX4  21.0   6  160 110
## Datsun 710 22.8   4  108  93

5. Conditional subsetting

Let’s go back to mtcars, the full dataframe.
What if we want to get all information for cars with mpg>20?
row_condition seems like it would be “mpg>20”. But this won’t work – we need to supply a true or false for each row.
column_condition is everything. Everything leads to a space or nothing.

BUT: the row condition will seem strange.
Note that “mtcars$mpg>20” leads to a series of true and false. For every true, we keep that row. For a false, we eliminate the row.

mtcars$mpg>20
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

row_condition is “mtcars$mpg>20”
column_condition is everything. Everything is indicated by a space or nothing.

#all columns for cars with mpg>20
mtcars[mtcars$mpg>20, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

6. Excluding rows or columns

For dataframes, -n means exclude row (or column) n.
This is quite useful when there are some observations that are faulty.

mtcars[-5, ] return the dataframe with all rows except row 5 and all columns

#all data except row 5
mtcars[-5, ]
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The same sort of logic holds for excluding columns.

#all columns except 5; all rows
mtcars[ ,-5]
##                      mpg cyl  disp  hp    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 2.780 18.60  1  1    4    2
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s