Learning Objectives
Following this assignment students should be able to:
- understand how to find data to supplement existing datasets
- understand API interfaces in R
- pull supplemental data into datasets in R
- merge supplemental data into datasets
Reading
-
Topics
- Finding ancillary & supplemental data
- Reference datasets for taxonomic checking
- API packages in R
- Merging with existing data
- join types
- roll joins
- interval overlap joins
-
ReadingsBrowsings
Lecture Notes
Schema assignment review & discussion
APIs & Data Integration
Introduction
API R packages
Joining data
Exercises
-- Supplemental Data Discussion --
Pairwise / Group Discussion:
-
Think about the research datasets you have (or will have). What else would you like to know about your species, your locations, your dates, your observations? Write down a few examples.
-
Where you would obtain the data you would like to have?
-
How would you join the data you obtain, to the data you already have? Try to be as specific as you can, thinking in terms of actual columns.
-
-- Worldclim data --
The worldclim dataset can be useful in distinguishing sites in the recent historic period - the “stationary” means across the thirty years from 1970 - 2000.
Install necessary packages
# raster package is an R interface to some GIS functionality install.packages('raster') library(raster)
Get worldclim data
## there are multiple options for worldclim - see ?getData # for bioclim at highest resolutions, a lat/long is specified bioclim <- getData("worldclim", var='bio', res=0.5, lon=-93.185, lat=44.985) bioclim
The data downloaded is a set of raster tiles. There is one tile for each of the 19 bioclim variables (see http://worldclim.org/bioclim)
When multiple raster tiles are combined in the same geographic location, it can be organized into a raster stack.
You can plot the raster values:
bioclim[[1]] # the mean annual temperature raster tile plot(bioclim[[1]]) # add reference lat-long pair points(x = -93.185, y = 44.985, cex = 2, pch = 16) plot(bioclim[[12]]) # mean annual precipitation in mm # add reference lat-long pair points(x = -93.185, y = 44.985, cex = 2, pch = 16)
But what you are mostly interested in, are the values themselves at your lat-long of interest. The
extract
function in the raster package does that.# make a lat-long data frame loc <- data.frame(site = 'UMN-StP', longitude = -93.185, latitude = 44.985) # overlay the lat-long data frame on the entire stack extract(bioclim, loc)
Challenge problem:
-
Connect to the Bee Lab database (your local copy, or the remote hosted one)
-
Create an R table that has the six sites with latitude and longitude:
# where 'con' is your connection definition: loc2 <- dbGetQuery(con, "select siteID, longitude, latitude from sites;") # because of db field definition (VARCHAR) must turn lat/lon to numeric loc2$latitude <- as.numeric(loc2$latitude) loc2$longitude <- as.numeric(loc2$longitude)
-
Plot the raster tile for Mean Temperature of the Coldest Quarter (refer to bioclim variable list linked above). Add the sites to the map.
-
Which site has the lowest mean temperature of the coldest quarter?
-
-- Get weather --
Weather is one of the most frequently desired covariates for any environmental analysis. The National Oceanic and Atmospheric Administration has a history of high quality, high frequency, high resolution datasets that are available to the public.
The National Climatic Data Center provides an API connection to historical weather.
The rOpenSci team has created an R package that queries this API directly, called
rnoaa
.- Get a key from NOAA NCDC - this is necessary to use the API package.
- go to https://www.ncdc.noaa.gov/cdo-web/token
- enter your email
- your “token” or key will be sent to you.
- Install and load the package, and paste your token into an object
install.packages('rnoaa') library(rnoaa) ?rnoaa mytoken <- 'QrKTCPmNwJWXxFhpLHijBfXPIRShBAiv' # this is EL's token
This part takes a while, so the identifier for the weather station is provided for you. But if you want to explore available stations, you can download the whole set (~600K stations, takes a few minutes).
#library(data.table) #station_data <- ghcnd_stations() #setDT(station_data) #station_data[latitude > 44.9 & latitude < 45 & longitude > -93.2 & longitude < -93, .N, name] #station_data[name == 'UNIV OF MINN ST PAUL']
Once the weather station is identified, you can pull observations:
# the weather station USC00218450 is on the St Paul Campus out <- ncdc(datasetid='GHCND', stationid='GHCND:USC00218450', datatypeid='TMAX', startdate = '2017-03-01', enddate = '2017-03-31', limit=500, token = mytoken) out # examine results ncdc_plot(out) # built-in plot function (limited but quick)
Challenge problem:
-
Connect to the Bee Lab database (your local copy, or the remote hosted one)
-
Get the “collection_events” table from the database into an R data frame.
-
Using the St. Paul campus weather station (ID above), download the precipitation data for the dates inclusive of all the collection events.
HINT: use datatypeid = 'PRCP'
-
What was the rainiest date during the study period? How much rain fell (in mm)?
- Get a key from NOAA NCDC - this is necessary to use the API package.
-- Check taxonomy --
The Plant List has a web API with an accompanying R package:
Taxonstand
. We will use the API to check a list of plant names for accuracy and synonomy.First, get necessary packages:
install.packages('Taxonstand') library(Taxonstand) library(tidyverse) library(RMariaDB)
We will use the
specimens
table from the Bee Lab dataset:# make your connection to local or remote db as 'con' specimens <- dbReadTable(con, "specimens") ### Small Table so we can disconnect ## dbDisconnect(con)
Creating a “genus-species” column helps with merging
specimens$gs <- paste(specimens$plant.genus, specimens$plant.species)
Now we get unique values for plant names in this dataset. What are some potential problems that you see here?
plants<-unique(specimens[c("plant.genus", "plant.species")]) plants
Now we can get data from The Plant List. The ‘TPL’ function does this. Note that it can take awhile with a lot of names…
sp.out <- as.data.frame(TPL(genus=plants$plant.genus, species=plants$plant.species)) sp.out
first let’s find things that do not match
out1<-filter(sp.out, Plant.Name.Index==FALSE) %>% select(Taxon, New.Genus, New.Species) out1
We can update plant names.
out1$New.Genus <- ifelse(out1$New.Genus == "common", "Helianthus","null") out1$New.Species <- ifelse(out1$New.Species == "sunflower", "annuus","null")
Now we have a data frame with Taxon (old name) and upated new names We want to keep this old name for joining with our data set at the end.
We can also look for typos
out2<-filter(sp.out, Typo==TRUE)%>% select(Taxon, New.Genus, New.Species) out2
Challenge problem:
-
Find any synonyms returned from the Plant List for the specimen names.
-
Join the original specimens table to the Plant List taxonomy.
-
How many bees were observed on Compositae flowers?
-