APIs and data integration

Learning Objectives

Following this assignment students should be able to:

understand how to find data to supplement existing datasets

understand API interfaces in R

pull supplemental data into datasets in R

merge supplemental data into datasets

Reading

Topics
- Finding ancillary & supplemental data
- Reference datasets for taxonomic checking
- API packages in R
- Merging with existing data
  - join types
  - roll joins
  - interval overlap joins
~~Readings~~ Browsings

Lecture Notes

Schema assignment review & discussion

APIs & Data Integration

Introduction

API R packages

Joining data

Exercises

-- Supplemental Data Discussion --

Pairwise / Group Discussion:
1. Think about the research datasets you have (or will have). What else would you like to know about your species, your locations, your dates, your observations? Write down a few examples.
2. Where you would obtain the data you would like to have?
3. How would you join the data you obtain, to the data you already have? Try to be as specific as you can, thinking in terms of actual columns.

-- Worldclim data --

The worldclim dataset can be useful in distinguishing sites in the recent historic period - the “stationary” means across the thirty years from 1970 - 2000.

Install necessary packages

# raster package is an R interface to some GIS functionality
install.packages('raster')
library(raster) 

Get worldclim data

## there are multiple options for worldclim - see ?getData
# for bioclim at highest resolutions, a lat/long is specified

bioclim <- getData("worldclim", var='bio', res=0.5, 
		lon=-93.185, lat=44.985)
bioclim

The data downloaded is a set of raster tiles. There is one tile for each of the 19 bioclim variables (see http://worldclim.org/bioclim)

When multiple raster tiles are combined in the same geographic location, it can be organized into a raster stack.

You can plot the raster values:

bioclim[[1]] # the mean annual temperature raster tile
plot(bioclim[[1]])

# add reference lat-long pair
points(x = -93.185, y = 44.985, cex = 2, pch = 16) 

plot(bioclim[[12]]) # mean annual precipitation in mm

# add reference lat-long pair
points(x = -93.185, y = 44.985, cex = 2, pch = 16) 

But what you are mostly interested in, are the values themselves at your lat-long of interest. The extract function in the raster package does that.

# make a lat-long data frame
loc <- data.frame(site = 'UMN-StP', 
		longitude = -93.185, latitude = 44.985)

# overlay the lat-long data frame on the entire stack 
extract(bioclim, loc)

Challenge problem:

Connect to the Bee Lab database (your local copy, or the remote hosted one)

Create an R table that has the six sites with latitude and longitude:

 # where 'con' is your connection definition:
 loc2 <- dbGetQuery(con, 
             "select siteID, longitude, latitude from sites;")

 # because of db field definition (VARCHAR) must turn lat/lon to numeric
 loc2$latitude <- as.numeric(loc2$latitude)
 loc2$longitude <- as.numeric(loc2$longitude)

Plot the raster tile for Mean Temperature of the Coldest Quarter (refer to bioclim variable list linked above). Add the sites to the map.
Which site has the lowest mean temperature of the coldest quarter?

[click here for output]

-- Get weather --

Weather is one of the most frequently desired covariates for any environmental analysis. The National Oceanic and Atmospheric Administration has a history of high quality, high frequency, high resolution datasets that are available to the public.

The National Climatic Data Center provides an API connection to historical weather.

The rOpenSci team has created an R package that queries this API directly, called rnoaa.
1. Get a key from NOAA NCDC - this is necessary to use the API package.
  - go to https://www.ncdc.noaa.gov/cdo-web/token
  - enter your email
  - your “token” or key will be sent to you.
2. Install and load the package, and paste your token into an object
```
install.packages('rnoaa')
library(rnoaa)
?rnoaa

mytoken <- 'QrKTCPmNwJWXxFhpLHijBfXPIRShBAiv'
# this is EL's token
```
This part takes a while, so the identifier for the weather station is provided for you. But if you want to explore available stations, you can download the whole set (~600K stations, takes a few minutes).
```
#library(data.table)
#station_data <- ghcnd_stations()
#setDT(station_data)
#station_data[latitude > 44.9 & latitude < 45 & longitude > -93.2 & longitude < -93,
             .N, name]
#station_data[name == 'UNIV OF MINN ST PAUL']
```
Once the weather station is identified, you can pull observations:
```
# the weather station USC00218450 is on the St Paul Campus
out <- ncdc(datasetid='GHCND',
 stationid='GHCND:USC00218450',
 datatypeid='TMAX',
 startdate = '2017-03-01',
 enddate = '2017-03-31',
 limit=500, token = mytoken)

out # examine results 
ncdc_plot(out) # built-in plot function (limited but quick)
```
Challenge problem:
1. Connect to the Bee Lab database (your local copy, or the remote hosted one)
2. Get the “collection_events” table from the database into an R data frame.
3. Using the St. Paul campus weather station (ID above), download the precipitation data for the dates inclusive of all the collection events.
  HINT: use datatypeid = 'PRCP'
4. What was the rainiest date during the study period? How much rain fell (in mm)?
[click here for output]
-- Check taxonomy --

The Plant List has a web API with an accompanying R package: Taxonstand. We will use the API to check a list of plant names for accuracy and synonomy.

First, get necessary packages:
```
install.packages('Taxonstand')
library(Taxonstand)
library(tidyverse)
library(RMariaDB) 
```
We will use the specimens table from the Bee Lab dataset:
```
# make your connection to local or remote db as 'con'
specimens <- dbReadTable(con, "specimens")

### Small Table so we can disconnect
##
dbDisconnect(con)
```
Creating a “genus-species” column helps with merging
```
specimens$gs <- paste(specimens$plant.genus, specimens$plant.species)
```
Now we get unique values for plant names in this dataset. What are some potential problems that you see here?
```
plants<-unique(specimens[c("plant.genus", "plant.species")])
plants
```
Now we can get data from The Plant List. The ‘TPL’ function does this. Note that it can take awhile with a lot of names…
```
sp.out <- as.data.frame(TPL(genus=plants$plant.genus, species=plants$plant.species))
sp.out
```
first let’s find things that do not match
```
out1<-filter(sp.out, Plant.Name.Index==FALSE) %>% select(Taxon, New.Genus, New.Species)
out1
```
We can update plant names.
```
out1$New.Genus <- ifelse(out1$New.Genus == "common", "Helianthus","null")
out1$New.Species <- ifelse(out1$New.Species == "sunflower", "annuus","null")
```
Now we have a data frame with Taxon (old name) and upated new names We want to keep this old name for joining with our data set at the end.

We can also look for typos
```
out2<-filter(sp.out, Typo==TRUE)%>% select(Taxon, New.Genus, New.Species)
out2
```
Challenge problem:
1. Find any synonyms returned from the Plant List for the specimen names.
2. Join the original specimens table to the Plant List taxonomy.
3. How many bees were observed on Compositae flowers?
[click here for output]

Assignment

Learning Objectives

Reading

Lecture Notes

Schema assignment review & discussion

APIs & Data Integration

Exercises

Pairwise / Group Discussion:

Challenge problem:

Challenge problem:

Challenge problem: