Learning Objectives
Following this session students should be able to:
- describe the value of metadata as distinct from data
- give a real-world example use of metadata
- write basic metadata to describe a dataset
- find resources for more specific metadata standards by subject area
Reading
Meta-information concepts for ecological data management
Lecture Notes
-
Guest lecture: MN DNR data governance practices
Exercises
-- CSV Header Metadata --
Download the code for this exercise
In order to ensure that the bare minimum of metadata is conveyed with the data, in can be useful to include the metadata as a header object at the top of a data file. This forces the end user to confront the metadata before the data; ignorance of the structure and purpose of the data from that point is an affirmative choice rather than the default condition.
The process of using R to add metadata to an exported data file is surprisingly straightforward. As an example we will create a short dataset, then attach metadata to the exported file.
The process relies on the
cat
andwriteLines
functions, which arebase
R functions.Create some sandbox data
out <- data.frame(plot = 1:20, trt = sample(LETTERS[1:4], 20, replace = T), sampled = rep(seq.Date(as.Date('2018-02-02'), as.Date('2018-02-06'), length.out = 5), 4), response = rnorm(n = 20, mean = 50, sd = 20) ) str(out)
Side note: easily generating fake data to play with is one of the best learning features of R.
Create a header and short description
header1 <- c('METADATA', 'This file contains measurements of experimental plots at Research Garden in Feburary 2018') cat(header1, sep='\n')
When you
cat
a statement, it renders it without vectorization or the quotation marks bracketing the elements (unless you specify them using thesep
argument).Create detailed column descriptions (“data dictionary”)
Here we will use the fact that R has certain object type classes that are allowed. These are not necessarily the same as those allowed in other schema - many standards to choose from! - but they map generally onto any sensible division of information types with most data.
names(out) unlist(lapply(out, class)) # the next bit combines the names, types, and an additional description column header2 <- data.frame(COLUMN = names(out), TYPE = unlist(lapply(out, class)), DEFINITION = c('plot identifier', 'treatment code; one of (A = control), (B = +5 mg PO4), (C = +5 mg NH4NO3), (D = +5 mg PO4 and +5 mg NH4NO3)', 'date sampled', 'dried plant mass (g)'))
Write each header, and data, adding to the end of an open data connection
The following process takes advantage of the
writeLines
function to append the data after the headers have been written.datestamp <- Sys.Date() # useful to indicate data export date datafile <- file(paste0('FebBiomass-',datestamp,'.csv'), open = 'wt') writeLines(header1, con = datafile) write.csv(header2, datafile, row.names = F) writeLines('DATA', con = datafile) write.csv(out, datafile, row.names = F) close(datafile)
-- JSON Metadata --
Download the code for this exercise
-
Access the open standard for camera trap data and download Supplementary Material 3, the JSON template. Save the file to your project directory.
- Install and explore the
jsonlite
package. There is much more about the relationship between data representation in R and JSON in the accompanying paper.install.packages('jsonlite') library(jsonlite) ?jsonlite
- Read in the JSON template you downloaded.
## read in a json schema #### template <- fromJSON('oo_99336.json') str(template) template$CameraTrapMetadataStandard
- From the JSON organization and the open standard paper, the
data structure can be recreated. Existing records from a table
of images and identifications can be re-arranged and placed into
a list to mimic the JSON schema.
## combining datasets to match schema definitions #### myImages <- list(Project = data.frame(ProjectID = 'thelindyardproject.org', ProjectName = 'Lind backyard Summer 2017', ProjectObjectives = 'Find out who has been eating my lettuces'), Deployment = data.frame(CameraDeploymentID = c(1:3), CameraDeploymentBeginDate = c('2017-06-01', '2017-06-07', '2017-06-10'), DeploymentLocationID = 'Shed wall' ), Images = data.frame(imageID = 1:10, dateTimeCaptured = sample(seq.Date(from = as.Date('2017-06-01'), to = as.Date('2017-07-01'), by = 1), 10, replace = T), photoType = c('Staff', rep('Animal', 4), 'Unidentifiable', rep('Animal', 4)), photoTypeIdentifiedBy = 'Eric Lind'), ImageAnimal = data.frame(imageID = c(rep(2, 2), 3, 4, 6:10), imageCount = 1, speciesScientificName = c('Vulpes vulpes', rep('Sylvilagus floridanus', 8))) ) myImages
- Once the structures are named and nested correctly, the export
to shareable, findable JSON data is straightforward:
toJSON(myImages, pretty = T) toJSON(template, pretty = T)
-
-- Homework1 --
Clean & Document dataset
Homework should be committed to your github.umn.edu repository by 4:30 Monday Feb 19.
You will work in the named repository (ENT5920-Lastname) you created in Week 3.
Using a dataset of your own, preferably one you will be using in your research, clean and document the data as follows:
-
Choose a limited dataset, 5 - 10 columns of data. Number of rows is less important except as it impedes your ability to post the data to github.
- Create an R script which does the following:
- reads the data in, using relativized (not hard coded) paths
- gives a summary table of the number of identifiers
- places, dates, species if applicable
- produces numerical summaries of variables
- ranges, central tendencies, quartiles
- histograms of values
- correlation among variables
- cleans any errors found, preferably with generic function
- outputs a clean dataset to a new location with datestamp
- Create a metadata record for this dataset. The format of
the metadata document should follow a standard for your
field, if possible. If not, associate the metadata with
the data by adding it to the header of the data document. The metadata record
should include:
- The project, and investigator(s) responsible for the data
- The geographic and temporal restriction of the data
- a short (1 paragraph) abstract of the purpose & methods
- a data dictionary with column definitions
-
Commit the data, QC script, and metadata documents to your local git repository.
-
Push the changes to your remote (github.umn.edu) repository.
- Ensure both Dan (dcarivea) and Eric (elind) are added as collaborators to your repository.
-