Data Collection and QC

Learning Objectives

Following this assignment students should be able to:

use version control to keep track of changes to code

collaborate with someone else via a remote repository

create a script to find and flag data points to check

Reading

Lecture Notes

Topics
- git in RStudio
- scripting QA / QC

R Studio & Git
- Version control idea & basics
- overall structure (remote & local; branches)
- add
- diff
- commit
  - writing commit messages
- revert
- push
- pull
- clone
- .gitignore
Scripting Data QC

Exercises

-- Set Up Git --

The University of Minnesota hosts an internal Github site. This site allows both private (only you and those you choose can see) and “public” (all those with a github.umn.edu account can see) repositories.

To complete the following exercise, you must:
- Have git installed for your operating system following the setup instructions.
- Create an account on github.umn.edu using your UMN login information.
Create a new repository at github.umn.edu:
1. Navigate to github.umn.edu in a web browser and login.
2. Click the + at the upper right corner of the page and choose New repository.
3. Fill in a Repository name that follows the form Lastname-ENT5920.
4. Select Private.
5. Select Initialize this repository with a README.
6. Click Create Repository.
Next, clone your new repository and set up a project in RStudio:
1. File -> New Project -> Version Control -> Git
2. Navigate to your new Git repo -> Click the Clone or download button -> Click the Copy to clipboard button.
3. Paste this in Repository URL:.
4. Leave Project directory name: blank; automatically given repo name.
5. Choose where to Create project as subdirectory of:.
6. Click Create Project.
7. Check to make sure you have a Git tab in the upper right window.
-- First Solo Commit --

This is a follow up to Set Up Git.

Copy the subdirectory folders you created as part of Week 2. This should include a /data subdirectory with a csv file, and a /code subdirectory with the script you used to read in the data.

Using the RStudio Git tab, commit these changes to version control with a good commit message. Then check to see if you can see this commit in the history.

Finally, push your changes to your remote repository. Check to see whether the commit appears in your github.umn.edu repo.
-- Git with a Partner --

This is a follow up to First Solo Commit.

Find a partner in the class, you will do some reciprocal collaboration on your new repositories.
1. Add your partner as a collaborator on your github.umn.edu repo
  - in your repo, click “Settings”
  - Choose “Collaborators” from left side panel
  - Search and add your partner by their umn ID
2. find partner’s repo on github.umn.edu
  - search for owner with UMN ID as follows “user:elind”
  - copy repo address.
3. In RStudo, click New Project and follow steps to create local repo
  - when prompted, enter your partner’s repo address.
4. confirm you have pulled your partner’s repo.
5. In your partner’s repo, add a comment to their data processing script
6. commit the change with an informative message
7. push the change to your partner’s remote repo
8. switch back to the first RStudio Project you first created.
9. Pull to get the changes from online repo.
10. In the ‘Git’ tab, open History to see how your partner modified the file.

-- Scripting Data QC --

n.b. the lecture notes for the Scripting QC portion contain most of the piecewise code to build this exercise

Say you are setting up a network of bird surveyors from around the state of Minnesota. As data coordinator, you will be receiving files from all over the state, and be expected to produce a clean, consistent dataset from a multitude of submitted observations.

Using the cleaned ‘WMA-bird’ dataset as a model, you decide that the following column names and types should be standard:

Column name	Type
WMA	character
date_sampled	date (YYYY-MM-DD)
latin_name	character
count_observed	integer

You ask each surveyor to at least make the effort to export their data from Excel as a csv. Write a script that will read in each file, then check:

to report whether the column names conform to standard
to report whether the types conform to standard

The script should output a list of column names for the file, whether they match the standard names, and the type of data according to your input procedure.

Write a function that will act on the count_observed column in the standard data. The function should return:

a Cleveland dotplot of the count values
a table of individual counts (where and of what species) which are outside the 95% central density of the sample.

-- Scripting Data QC --

n.b. the lecture notes for the Scripting QC portion contain most of the piecewise code to build this exercise

Using the cleaned ‘WMA-bird’ dataset as a model, you decide that the following column names and types should be standard:

Column name	Type
WMA	character
date_sampled	date (YYYY-MM-DD)
latin_name	character
count_observed	integer

You ask each surveyor to at least make the effort to export their data from Excel as a csv. Write a script that will read in each file, then check:

to report whether the column names conform to standard
to report whether the types conform to standard

The script should output a list of column names for the file, whether they match the standard names, and the type of data according to your input procedure.

Write a function that will act on the count_observed column in the standard data. The function should return:

a Cleveland dotplot of the count values
a table of individual counts (where and of what species) which are outside the 95% central density of the sample.

Data Management for Biologists

Assignment

Learning Objectives

Reading

Lecture Notes

Exercises

-- Set Up Git --

-- First Solo Commit --

-- Git with a Partner --

-- Scripting Data QC --

-- Scripting Data QC --