Data Manipulation in R
We will use the connection to the BeeLab dataset hosted on your local machine for these exercises:
- Start MAMP - ensure it is running
- Open the R Studio project file you have associated with your local repo of ENT5920_DatabaseTutorial
- Before beginning,
pull
to get the latest version of any shared code.
Basic dplyr
- Modern data manipulation library for R
- Does a lot of the same things we’ve learned to do in SQL.
Pull in data
bee_traits <- dbReadTable(con, "bee_traits")
library(dplyr)
library(data.table)
dat <- setDT(bee_traits)
df <- data.frame(bee_traits)
class(dat); class(df)
- Select:
select(df, Family, genus, species, Parasitic)
- Filter:
filter(df, Family == 'Andrenidae')
filter(df, n_Lat > 0)
filter(df, n_Lat > 0, Family == 'Andrenidae')
- Mutate:
mutate(df, FamCode2 = substr(Family, 1, 4))
Aggregation
- Group by:
group_by(df, Family)
- Different looking kind of
data.frame
- Source, grouping, and data type information
Famdf <- group_by(df, Family)
- Grouping with
summarize()
:summarize(Famdf, num_species = length(species))
Joins
inner_join
indplyr
works likeJOIN
in SQL
specimens <- dbReadTable(con, "specimens")
combined <- inner_join(specimens, df, by = c('genus', 'species'))
head(combined)
Pipes
- Combine a series of data manipulation actions
- Intermediate variables
- Step-wise approach
- Can get cumbersome with lots of variable objects in the environment
parasiticBees <- filter(df, Parasitic == 'Yes')
parasiticBees_byFam <- group_by(parasiticBees, Family)
parasitic_avg_emerge_byFam <- summarize(parasiticBees_byFam,
avg_emerge = mean(Pheno_mean, na.rm=TRUE))
parasitic_avg_emerge_byFam
- Pipes:
- Operator:
%>%
- Operation:
%>%
takes the output of one command and passes it as input to the next commandx %>% f(y)
translates tof(x, y)
surveys %>% filter(species_id == "DS")
- Operator:
df %>% # Command-Shift-M is the shortcut to insert a pipe %>%
filter(Parasitic == 'Yes') %>%
group_by(Family) %>%
summarize(avg_emerge = mean(Pheno_mean, na.rm = TRUE))