Intro to data.table

Introduction to `data.table`

The data.table package for R is the brainchild of two heavy data users in R, one (Matt Dowle) from the financial industry, and one (Arun Srinivasan) from genomics. Both using large amounts of data, and finding ways to code things to go faster and more efficiently.

setting as data.table

A data.table is like data.frame plus. In fact it is also a data.frame, meaning whatever code you use on a data.frame (including dplyr or other tidyverse code) works fine on data.table objects.

Pull in data

bee_traits <- dbReadTable(con, "bee_traits")

library(data.table)

dat <- setDT(bee_traits)
class(dat)
dat

I like how data.table has nicely formatted print options for data.

i, j, by

The overarching concept of data.table is “i, j, by.”

i = rows to use; subsetting
j	= columns to use; selecting
by = columns to aggregate with

These concepts are applied within the square bracket [ ] select domain as with base R.

Column selections are sort of like base, without the fussiness of the quoting:

Select (j):
- dat[, .(Family, genus, species, Parasitic)]
Filter (i):
- dat[Family == 'Andrenidae'] # note no comma
- dat[n_Lat > 0]
- dat[n_Lat > 0 & Family == 'Andrenidae']
Mutate (set j):
- dat[, FamCode := substr(Family, 1, 4)]

In this case, in the place of selecting a column or set of columns for j, we are assigning a value to a new column. This operation still takes place in the j (or “SELECT”) part of the bracket subset. However it includes a new symbol: :=. This symbol means “Set equal to.”

One notable difference with data.table is that column definitions or modifications and calculations are made using assignment by reference. In the internals of R, when a new column is made (for instance with mutate from dplyr), the entire table is copied to add it. This can be expensive from a memory (and thus speed) perspective when R is working with larger files. Instead, the data.table approach is to make a reference which is similar to an index, which tells R how to calculate the field when it is needed, rather than make a copy of the table to have it immediately available.

Do Exercise 2 - Bee Trait Data Basics, but with data.table syntax.

Aggregation

This is an area of excellence for data.table.

Group by:

By placing a second comma, the “by” part of i, j, by is introduced.

What follows the second comma is used to aggregate.

Famdf <- dat[, .N, Family]

Here is another super-useful shortcut: .N. This simply means “count”. But when in combination with group by variables, it becomes a work horse of tabling.

The summary variables in aggregation can be renamed:

Grouping:
- dat[, .(num_species = .N), Family]

Do Exercise 3 - Bee Trait Aggregate, but with data.table syntax.

Joins

In data.table, the i in i, j, by can be a filtering statement, or it can be another data.table that is indexed to match the target data. This works like merge or joining:

specimens <- dbReadTable(con, "specimens")
specDT <- setDT(specimens)
specimens[dat, on = .(genus, species), nomatch = 0L]

The nomatch function tells data.table what to do if there is no match for the on columns. The default is to keep the rows, but set the corresponding columns to NA. The nomatch = 0L option results in non-matched rows being dropped.

Do Exercise 4 - Bee Table Merge, but with data.table syntax.

Combining operations

Combine a series of data manipulation actions
Intermediate variables
- Step-wise approach
- Can get cumbersome with lots of variable objects in the environment

parasiticBees <- filter(df, Parasitic == 'Yes')
parasiticBees_byFam <- group_by(parasiticBees, Family)
parasitic_avg_emerge_byFam <- summarize(parasiticBees_byFam, 
                                 avg_emerge = mean(Pheno_mean, na.rm=TRUE))
parasitic_avg_emerge_byFam

Thanks to i, j, by data.table can do this in one step: filter (i), assign & select (j), aggregate by (by)

This is why data.table appeals to those used to thinking about SQL and relational databases, because it mimics the efficiency of SQL syntax.

dat[Parasitic == 'Yes', .(avg_emerge = mean(Pheno_mean, na.rm = TRUE)), Family]

Do Exercise 5 - Bee Piping, but with data.table syntax.

Data Management for Biologists

Notes

Introduction to data.table