Learning Objectives
Following this assignment students should be able to:
- use version control to keep track of changes to code
- collaborate with someone else via a remote repository
- create a script to find and flag data points to check
Reading
Lecture Notes
- Topics
- git in RStudio
- scripting QA / QC
- R Studio & Git
- Version control idea & basics
- overall structure (remote & local; branches)
- add
- diff
- commit
- writing commit messages
- revert
- push
- pull
- clone
- .gitignore
- Scripting Data QC
Exercises
-- Set Up Git --
The University of Minnesota hosts an internal Github site. This site allows both private (only you and those you choose can see) and “public” (all those with a github.umn.edu account can see) repositories.
To complete the following exercise, you must:
- Have git installed for your operating system following the setup instructions.
- Create an account on github.umn.edu using your UMN login information.
Create a new repository at github.umn.edu:
- Navigate to github.umn.edu in a web browser and login.
- Click the
+
at the upper right corner of the page and chooseNew repository
. - Fill in a
Repository name
that follows the formLastname-ENT5920
. - Select
Private
. - Select
Initialize this repository with a README
. - Click
Create Repository
.
Next, clone your new repository and set up a project in RStudio:
- File -> New Project -> Version Control -> Git
- Navigate to your new Git repo -> Click the
Clone or download
button -> Click theCopy to clipboard
button. - Paste this in
Repository URL:
. - Leave
Project directory name:
blank; automatically given repo name. - Choose where to
Create project as subdirectory of:
. - Click
Create Project
. - Check to make sure you have a
Git
tab in the upper right window.
-- First Solo Commit --
This is a follow up to Set Up Git.
Copy the subdirectory folders you created as part of Week 2. This should include a
/data
subdirectory with a csv file, and a/code
subdirectory with the script you used to read in the data.Using the RStudio Git tab, commit these changes to version control with a good commit message. Then check to see if you can see this commit in the history.
Finally, push your changes to your remote repository. Check to see whether the commit appears in your github.umn.edu repo.
-- Git with a Partner --
This is a follow up to First Solo Commit.
Find a partner in the class, you will do some reciprocal collaboration on your new repositories.
- Add your partner as a collaborator on your github.umn.edu repo
- in your repo, click “Settings”
- Choose “Collaborators” from left side panel
- Search and add your partner by their umn ID
- find partner’s repo on github.umn.edu
- search for owner with UMN ID as follows “user:elind”
- copy repo address.
- In RStudo, click New Project and follow steps to create local repo
- when prompted, enter your partner’s repo address.
-
confirm you have pulled your partner’s repo.
-
In your partner’s repo, add a comment to their data processing script
-
commit the change with an informative message
-
push the change to your partner’s remote repo
-
switch back to the first RStudio Project you first created.
-
Pull to get the changes from online repo.
- In the ‘Git’ tab, open History to see how your partner modified the file.
- Add your partner as a collaborator on your github.umn.edu repo
-- Scripting Data QC --
n.b. the lecture notes for the Scripting QC portion contain most of the piecewise code to build this exercise
Say you are setting up a network of bird surveyors from around the state of Minnesota. As data coordinator, you will be receiving files from all over the state, and be expected to produce a clean, consistent dataset from a multitude of submitted observations.
Using the cleaned ‘WMA-bird’ dataset as a model, you decide that the following column names and types should be standard:
Column name Type WMA character date_sampled date (YYYY-MM-DD) latin_name character count_observed integer - You ask each surveyor to at least make the effort to export their data from Excel as a csv. Write a script that will read in each file, then check:
- to report whether the column names conform to standard
- to report whether the types conform to standard
The script should output a list of column names for the file, whether they match the standard names, and the type of data according to your input procedure.
- Write a function that will act on the
count_observed
column in the standard data. The function should return:
- a Cleveland dotplot of the count values
- a table of individual counts (where and of what species) which are outside the 95% central density of the sample.
-- Scripting Data QC --
n.b. the lecture notes for the Scripting QC portion contain most of the piecewise code to build this exercise
Say you are setting up a network of bird surveyors from around the state of Minnesota. As data coordinator, you will be receiving files from all over the state, and be expected to produce a clean, consistent dataset from a multitude of submitted observations.
Using the cleaned ‘WMA-bird’ dataset as a model, you decide that the following column names and types should be standard:
Column name Type WMA character date_sampled date (YYYY-MM-DD) latin_name character count_observed integer - You ask each surveyor to at least make the effort to export their data from Excel as a csv. Write a script that will read in each file, then check:
- to report whether the column names conform to standard
- to report whether the types conform to standard
The script should output a list of column names for the file, whether they match the standard names, and the type of data according to your input procedure.
- Write a function that will act on the
count_observed
column in the standard data. The function should return:
- a Cleveland dotplot of the count values
- a table of individual counts (where and of what species) which are outside the 95% central density of the sample.