Metadata

Learning Objectives

Following this session students should be able to:

describe the value of metadata as distinct from data

give a real-world example use of metadata

write basic metadata to describe a dataset

find resources for more specific metadata standards by subject area

Reading

Meta-information concepts for ecological data management

Lecture Notes

Guest lecture: MN DNR data governance practices
Metadata: motivation and creation

Exercises

-- CSV Header Metadata --

Download the code for this exercise

In order to ensure that the bare minimum of metadata is conveyed with the data, in can be useful to include the metadata as a header object at the top of a data file. This forces the end user to confront the metadata before the data; ignorance of the structure and purpose of the data from that point is an affirmative choice rather than the default condition.

The process of using R to add metadata to an exported data file is surprisingly straightforward. As an example we will create a short dataset, then attach metadata to the exported file.

The process relies on the cat and writeLines functions, which are base R functions.

Create some sandbox data

out <- data.frame(plot = 1:20, 
                  trt = sample(LETTERS[1:4], 20, replace = T),
                  sampled = rep(seq.Date(as.Date('2018-02-02'),
                   as.Date('2018-02-06'), length.out = 5), 4),
                  response = rnorm(n = 20, mean = 50, sd = 20)
)
str(out)

Side note: easily generating fake data to play with is one of the best learning features of R.

Create a header and short description

header1 <- c('METADATA',
             'This file contains measurements of experimental plots at Research Garden in Feburary 2018')
cat(header1, sep='\n')

When you cat a statement, it renders it without vectorization or the quotation marks bracketing the elements (unless you specify them using the sep argument).

Create detailed column descriptions (“data dictionary”)

Here we will use the fact that R has certain object type classes that are allowed. These are not necessarily the same as those allowed in other schema - many standards to choose from! - but they map generally onto any sensible division of information types with most data.

names(out)
unlist(lapply(out, class))

# the next bit combines the names, types, and an additional description column
header2 <- data.frame(COLUMN = names(out),
                      TYPE = unlist(lapply(out, class)),
                      DEFINITION = c('plot identifier',
                                   'treatment code; one of (A = control), (B = +5 mg PO4), (C = +5 mg NH4NO3), (D = +5 mg PO4 and +5 mg NH4NO3)',
                                   'date sampled',
                                   'dried plant mass (g)'))

Write each header, and data, adding to the end of an open data connection

The following process takes advantage of the writeLines function to append the data after the headers have been written.

datestamp <- Sys.Date() # useful to indicate data export date
datafile <- file(paste0('FebBiomass-',datestamp,'.csv'), open = 'wt')
writeLines(header1, con = datafile)
write.csv(header2, datafile, row.names = F)
writeLines('DATA', con = datafile)
write.csv(out, datafile, row.names = F)
close(datafile)

-- JSON Metadata --

Download the code for this exercise

Access the open standard for camera trap data and download Supplementary Material 3, the JSON template. Save the file to your project directory.
Install and explore the jsonlite package. There is much more about the relationship between data representation in R and JSON in the accompanying paper.
```
install.packages('jsonlite')
library(jsonlite)
?jsonlite
```

Read in the JSON template you downloaded.

## read in a json schema ####
template <- fromJSON('oo_99336.json')
str(template)
template$CameraTrapMetadataStandard

From the JSON organization and the open standard paper, the data structure can be recreated. Existing records from a table of images and identifications can be re-arranged and placed into a list to mimic the JSON schema.

## combining datasets to match schema definitions ####
myImages <- list(Project = data.frame(ProjectID = 'thelindyardproject.org',
                    ProjectName = 'Lind backyard Summer 2017', 
                    ProjectObjectives = 'Find out who has been eating my lettuces'),
              Deployment = data.frame(CameraDeploymentID = c(1:3),
                                      CameraDeploymentBeginDate = c('2017-06-01', 
                                                                    '2017-06-07',
                                                                    '2017-06-10'),
                                      DeploymentLocationID = 'Shed wall'
                                      ),
              Images = data.frame(imageID = 1:10,
                                   dateTimeCaptured = sample(seq.Date(from = as.Date('2017-06-01'),
                                                                      to = as.Date('2017-07-01'),
                                                                      by = 1), 10, replace = T),
                                   photoType = c('Staff', rep('Animal', 4), 'Unidentifiable', 
                                                 rep('Animal', 4)),
                                   photoTypeIdentifiedBy = 'Eric Lind'),
              ImageAnimal = data.frame(imageID = c(rep(2, 2), 3, 4, 6:10),
                                       imageCount = 1,
                                       speciesScientificName = c('Vulpes vulpes', rep('Sylvilagus floridanus', 8)))
                    )
myImages

Once the structures are named and nested correctly, the export to shareable, findable JSON data is straightforward:
```
toJSON(myImages, pretty = T)
toJSON(template, pretty = T)
```

-- Homework1 --

Clean & Document dataset

Homework should be committed to your github.umn.edu repository by 4:30 Monday Feb 19.

You will work in the named repository (ENT5920-Lastname) you created in Week 3.

Using a dataset of your own, preferably one you will be using in your research, clean and document the data as follows:
1. Choose a limited dataset, 5 - 10 columns of data. Number of rows is less important except as it impedes your ability to post the data to github.
2. Create an R script which does the following:
  - reads the data in, using relativized (not hard coded) paths
  - gives a summary table of the number of identifiers
    - places, dates, species if applicable
  - produces numerical summaries of variables
    - ranges, central tendencies, quartiles
    - histograms of values
    - correlation among variables
  - cleans any errors found, preferably with generic function
  - outputs a clean dataset to a new location with datestamp
3. Create a metadata record for this dataset. The format of the metadata document should follow a standard for your field, if possible. If not, associate the metadata with the data by adding it to the header of the data document. The metadata record should include:
  - The project, and investigator(s) responsible for the data
  - The geographic and temporal restriction of the data
  - a short (1 paragraph) abstract of the purpose & methods
  - a data dictionary with column definitions
4. Commit the data, QC script, and metadata documents to your local git repository.
5. Push the changes to your remote (github.umn.edu) repository.
6. Ensure both Dan (dcarivea) and Eric (elind) are added as collaborators to your repository.

Data Management for Biologists

Assignment