Data Manipulation

Learning Objectives

Following this assignment students should be able to:

understand the data manipulation functions of dplyr

understand the

Reading

Topics
- dplyr
- data.table
Readings

Lecture Notes

Assignment Introduction: Creating a schema

Wrapping up SQL

Data manipulation in R

dplyr
- filter
- select
- arrange
- mutate
- summarise
- pipes & syntax
data.table

Exercises

-- dplyr --

Install and familiarize yourself with the dplyr package. The library() step(s) should always be located at the very top of a script.
```
install.packages("dplyr")

library(dplyr)

help(package = dplyr)
```
This vignette is a great reference for data manipulation verbs to keep in mind.
-- Bee Trait Data Basics --

Connect to the BeeLab dataset on your local MySQL instance. In R:
1. Create a connection to the database using RMariaDB
2. Import the bee_traits table into a named data frame.
3. Check the column names in the data using the function names().
4. Use str() to show the structure of the data frame and its individual columns.
5. Print out the first few rows of the data using the function head().
  
  Use dplyr to complete the remaining tasks.
6. Select the first few rows of data from the Taxon_Author column and print it out.
7. Select the data from the Family, Subfamily, and Tribe columns and print it out.
8. Filter the data for all of the bees that are Parasitic and print out their genus and species.
9. Create a new data frame called endangered_bees which have a value of “endangered” in the USFWS_status column. Print it out.
[click here for output]
-- Bee Trait Aggregate --

This is a follow-up to Bee Trait Data Basics.

For data.frame df this dplyr code calculates the average day of emergence by Family:
```
Famdf <- group_by(df[!is.na(Pheno_mean)], Family)
summarize(Famdf, emerge_Julian = mean(Pheno_mean))
```
1. Modify the code to calculate and print the average emergence of a bee species in each Subfamily.
2. Use max() to determine the latest bee emergence in each Family.
[click here for output]
-- Bee Table Merge --

This is a follow-up to Bee Trait Aggregation.

From the BeeLab database, import the mn_bees table and then use inner_join to combine it with the bee_traits table to add a Bee_Family column to the mn_bees data.
-- Bee Piping --

This is a follow-up to Bee Trait Aggregation and Bee Table Merge.
1. Import the sites table from the BeeLab database into a data.frame. You should already have bee_traits and specimens in your R environment.
  
  Use dplyr to perform the following
2. Calculate the site that has the highest bee species richness.
3. Calculate the unique number of Parasitic bees observed in each soil type.
4. For all parasitic bees in bee_traits, make a table showing the number of species in each genus by Lecticity (the columns will be something like genus, Lecticity, count_sp).

Data Management for Biologists

Assignment

Learning Objectives

Reading

Lecture Notes

Wrapping up SQL

Data manipulation in R

Exercises

-- dplyr --

-- Bee Trait Data Basics --

-- Bee Trait Aggregate --

-- Bee Table Merge --

-- Bee Piping --