data.frames
Basics
Data.frames are one of the most frequently used data structures in R. Data.frames organize data into a 2D table consisting of rows & columns, where each column represents a variable and each row contains one value for each column.
Bracket Subsetting/Indexing
Creating a data.frame is easily done by filling in the columns using vectors, which are declared using c()
as follows.
Data Frame Creation
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
ok=c(T, T, F), other=c("first", "second", "third"))
head(df)
cat_1 cat_2 ok other 1 1 9 TRUE first 2 2 8 TRUE second 3 3 7 FALSE third
The parameter names in the data.frame()
function become the columns of the data.frame, while the number of rows are determined by the size of the vectors.
The different columns of a data frame can contain different types of values, but the variables within the column must have the same type. In this case, cat_1
and cat_2
contain integers, ok
contains booleans, and other
contains Strings.
Indexing Rows Numerically
Regular indexing rules apply to R data frames. Pay close attention to the commas in the following examples:
df[1:2, ]
cat_1 cat_2 ok other 1 1 9 TRUE first 2 2 8 TRUE second
This method uses the indices of the rows, which are independent of the row names. We can update the names of the rows and subsequently index those as well, if row names are appropriate for the situation.
Row Naming & Indexing on Row Names
row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"), ]
cat_1 cat_2 ok other row1 1 9 TRUE first row3 3 7 FALSE third
Though the row names replace the numerical indices in the output, we can still index using either. This same logic applies to columns, which also have intrinsic indices and are required to be named in order to be created. |
So far we’ve indexed in two ways, and their differences merit explanation:
-
:
selects indices based on the given sequence. In R, this process is inclusive, meaning that1:4
will select the first, second, third, and fourth entries. -
c()
defines a vector, as explained in the Lists & Vectors page, and indexing on vectors will select all rows/columns shared between the vector and data frame.
Logical Indexing
Indexing can also be done logically using a vector of Boolean values:
# selection is True for the first line,
# False for the second, and True for the third
df[c(T,F,T),]
cat_1 cat_2 ok other 1 1 9 TRUE first 3 3 7 FALSE third
For all of the above examples, there was at least one comma — anything before the comma defines row selection, and anything after the comma defines column selection. If you leave out the comma, R will default to column selection.
Column-Default Indexing
df[c("cat_1", "ok")]
cat_1 ok row1 1 TRUE row2 2 TRUE row3 3 FALSE
This is equivalent to leaving a blank space before the comma:
Indexing Column-Specific
df[, c(1,3)]
cat_1 ok row1 1 TRUE row2 2 TRUE row3 3 FALSE
We can apply sequence-indexing and logical indexing to columns in the same way. You’ll find that indexing rows and indexing columns is a nearly identical process that is easy to get hold of. We can combine any of the previous methods to index rows and columns simultaneously.
Here are a few examples that combine everything we discussed in the previous section:
df[1:2, c(1,3)]
cat_1 ok row1 1 TRUE row2 2 TRUE
df[c(T,F,T), c(T, F, F, F)]
[1] 1 3
$ Subsetting/Indexing
A key feature of R is the $
operator on data.frames, which is the more common indexing method for R if only one column is needed.
$ Column Indexing
df$cat_1
[1] 1 2 3
You can extend this to index for row as well using It’s good to keep in mind that |
Additionally, you can select values from a column with a vector of boolean values:
df$cat_1[c(F,T,F)]
[1] 2
Examples
How can I get the first 2 rows of a data.frame named df
?
Click to see solution
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
ok=c(T, T, F), other=c("first", "second", "third"))
df[1:2,]
cat_1 cat_2 ok other 1 1 9 TRUE first 2 2 8 TRUE second
How can I get the first 2 columns of a data.frame named df
?
Click to see solution
df[,1:2]
cat_1 cat_2 1 1 9 2 2 8 3 3 7
How can I get the rows where values in the column named cat_1
are greater than 2?
Click to see solution
# first example, using $
df[df$cat_1 > 2,]
cat_1 cat_2 ok other 3 3 7 FALSE third
# second example, using []
df[df[, c("cat_1")] > 2,]
cat_1 cat_2 ok other 3 3 7 FALSE third
How can I get the rows where values in the column named cat_1
are greater than 2 and the values in the column named cat_2
are less than 9?
Click to see solution
df[df$cat_1 > 2 & df$cat_2 < 9,]
cat_1 cat_2 ok other 3 3 7 FALSE third
How can I get the rows where values in the column named cat_1
are greater than 2 or the values in the column named cat`_2 are less than 9?
Click to see solution
df[df$cat_1 > 2 | df$cat_2 < 9,]
cat_1 cat_2 ok other 2 2 8 TRUE second 3 3 7 FALSE third