Learning Objectives
Following this assignment students should be able to:
- understand how to find data to supplement existing datasets
- understand API interfaces in R
- pull supplemental data into datasets in R
- merge supplemental data into datasets
Reading
- 
    Topics - Finding ancillary & supplemental data
- Reference datasets for taxonomic checking
- API packages in R
- Merging with existing data
        - join types
- roll joins
- interval overlap joins
 
 
- 
    ReadingsBrowsings
Lecture Notes
Schema assignment review & discussion
APIs & Data Integration
Introduction
API R packages
Joining data
Exercises
- -- Supplemental Data Discussion --- Pairwise / Group Discussion:- 
    Think about the research datasets you have (or will have). What else would you like to know about your species, your locations, your dates, your observations? Write down a few examples. 
- 
    Where you would obtain the data you would like to have? 
- 
    How would you join the data you obtain, to the data you already have? Try to be as specific as you can, thinking in terms of actual columns. 
 
- 
    
- -- Worldclim data --- The worldclim dataset can be useful in distinguishing sites in the recent historic period - the “stationary” means across the thirty years from 1970 - 2000. - Install necessary packages - # raster package is an R interface to some GIS functionality install.packages('raster') library(raster)- Get worldclim data - ## there are multiple options for worldclim - see ?getData # for bioclim at highest resolutions, a lat/long is specified bioclim <- getData("worldclim", var='bio', res=0.5, lon=-93.185, lat=44.985) bioclim- The data downloaded is a set of raster tiles. There is one tile for each of the 19 bioclim variables (see http://worldclim.org/bioclim) - When multiple raster tiles are combined in the same geographic location, it can be organized into a raster stack. - You can plot the raster values: - bioclim[[1]] # the mean annual temperature raster tile plot(bioclim[[1]]) # add reference lat-long pair points(x = -93.185, y = 44.985, cex = 2, pch = 16) plot(bioclim[[12]]) # mean annual precipitation in mm # add reference lat-long pair points(x = -93.185, y = 44.985, cex = 2, pch = 16)- But what you are mostly interested in, are the values themselves at your lat-long of interest. The - extractfunction in the raster package does that.- # make a lat-long data frame loc <- data.frame(site = 'UMN-StP', longitude = -93.185, latitude = 44.985) # overlay the lat-long data frame on the entire stack extract(bioclim, loc)- Challenge problem:- 
    Connect to the Bee Lab database (your local copy, or the remote hosted one) 
- 
    Create an R table that has the six sites with latitude and longitude: # where 'con' is your connection definition: loc2 <- dbGetQuery(con, "select siteID, longitude, latitude from sites;") # because of db field definition (VARCHAR) must turn lat/lon to numeric loc2$latitude <- as.numeric(loc2$latitude) loc2$longitude <- as.numeric(loc2$longitude)
- 
    Plot the raster tile for Mean Temperature of the Coldest Quarter (refer to bioclim variable list linked above). Add the sites to the map. 
- 
    Which site has the lowest mean temperature of the coldest quarter? 
 
- 
    
- -- Get weather --- Weather is one of the most frequently desired covariates for any environmental analysis. The National Oceanic and Atmospheric Administration has a history of high quality, high frequency, high resolution datasets that are available to the public. - The National Climatic Data Center provides an API connection to historical weather. - The rOpenSci team has created an R package that queries this API directly, called - rnoaa.- Get a key from NOAA NCDC - this is necessary to use the API package.
    - go to https://www.ncdc.noaa.gov/cdo-web/token
- enter your email
- your “token” or key will be sent to you.
 
- Install and load the package, and paste your token into an object
 - install.packages('rnoaa') library(rnoaa) ?rnoaa mytoken <- 'QrKTCPmNwJWXxFhpLHijBfXPIRShBAiv' # this is EL's token- This part takes a while, so the identifier for the weather station is provided for you. But if you want to explore available stations, you can download the whole set (~600K stations, takes a few minutes). - #library(data.table) #station_data <- ghcnd_stations() #setDT(station_data) #station_data[latitude > 44.9 & latitude < 45 & longitude > -93.2 & longitude < -93, .N, name] #station_data[name == 'UNIV OF MINN ST PAUL']- Once the weather station is identified, you can pull observations: - # the weather station USC00218450 is on the St Paul Campus out <- ncdc(datasetid='GHCND', stationid='GHCND:USC00218450', datatypeid='TMAX', startdate = '2017-03-01', enddate = '2017-03-31', limit=500, token = mytoken) out # examine results ncdc_plot(out) # built-in plot function (limited but quick)- Challenge problem:- 
    Connect to the Bee Lab database (your local copy, or the remote hosted one) 
- 
    Get the “collection_events” table from the database into an R data frame. 
- 
    Using the St. Paul campus weather station (ID above), download the precipitation data for the dates inclusive of all the collection events. HINT: use datatypeid = 'PRCP'
- 
    What was the rainiest date during the study period? How much rain fell (in mm)? 
 
- Get a key from NOAA NCDC - this is necessary to use the API package.
    
- -- Check taxonomy --- The Plant List has a web API with an accompanying R package: - Taxonstand. We will use the API to check a list of plant names for accuracy and synonomy.- First, get necessary packages: - install.packages('Taxonstand') library(Taxonstand) library(tidyverse) library(RMariaDB)- We will use the - specimenstable from the Bee Lab dataset:- # make your connection to local or remote db as 'con' specimens <- dbReadTable(con, "specimens") ### Small Table so we can disconnect ## dbDisconnect(con)- Creating a “genus-species” column helps with merging - specimens$gs <- paste(specimens$plant.genus, specimens$plant.species)- Now we get unique values for plant names in this dataset. What are some potential problems that you see here? - plants<-unique(specimens[c("plant.genus", "plant.species")]) plants- Now we can get data from The Plant List. The ‘TPL’ function does this. Note that it can take awhile with a lot of names… - sp.out <- as.data.frame(TPL(genus=plants$plant.genus, species=plants$plant.species)) sp.out- first let’s find things that do not match - out1<-filter(sp.out, Plant.Name.Index==FALSE) %>% select(Taxon, New.Genus, New.Species) out1- We can update plant names. - out1$New.Genus <- ifelse(out1$New.Genus == "common", "Helianthus","null") out1$New.Species <- ifelse(out1$New.Species == "sunflower", "annuus","null")- Now we have a data frame with Taxon (old name) and upated new names We want to keep this old name for joining with our data set at the end. - We can also look for typos - out2<-filter(sp.out, Typo==TRUE)%>% select(Taxon, New.Genus, New.Species) out2- Challenge problem:- 
    Find any synonyms returned from the Plant List for the specimen names. 
- 
    Join the original specimens table to the Plant List taxonomy. 
- 
    How many bees were observed on Compositae flowers? 
 
- 
    
