Learning Objectives

Following this assignment students should be able to:

  • explain why coding is advantageous for data management
  • format R code for readability and clarity
  • add comments and breaks to R code
  • call R scripts from within R scripts (sourcing)
  • write simple custom R functions
  • organize and R project and workspace
  • connect to git, and perform commit, push, and pull

Reading

Lecture Notes

  1. Why code? Why R?

  2. (Re-)Introduction to R & RStudio

  3. Principles and Style Elements of Code Style

  4. Functionalizing & Sourcing

  5. Directory Organization & Workflow in R Studio

  6. R Studio & Git


Exercises

  1. -- Install packages --

    The existence of packages written specifically for performing certain tasks is one of the things that makes R such a great scripting platform for data management, analysis, and visualization.

    There are recommendations of packages which contain functions built especially for lots of commonly used research areas of biological computing, such as analyzing spatial data, ecological and environmental analysis, and genetics.

    Some packages are interrelated with others, and share common style and language–the tidyverse is one well-known example for data management and statistics (“data science”).

    The CRAN (Comprehensive R Archive Network) contains versions of packages which are tested and known to work on all systems and under the specified versions of R. Other package contributors who want to share their work without undergoing CRAN review will offer them elsewhere online (e.g. at GitHub).

    We will be exploring different approaches to managing and visualizing data in R. These are some of the packages which will be the work horses of our methods.

    1. Install the packages “ggplot2” and “RMariaDB” using the Install button on the “Packages” tab of the RStudio panel. As the installation proceeds, watch the Console for the function call and response.

    2. Using what you observed in the console, install the package “data.table” from the command line (console or script window).

    3. Make all three packages active in the environment.

    4. Display the help for the function fread in the “data.table” package.

  2. -- Create script --

    Being an open-source product, R requires occasional intervention by you as a user to make sure your software is up to date.

    When upgrading to a new major release (e.g. R 3.3 to 3.4), the packages you have installed should be re-installed to get the latest compatible versions (and sometimes even for them to work). With some bundles of packages this is straightforward, but you will almost certainly need to download, install, and use packages from a variety of sources in your work.

    Here, you will create a utility script which you can call when you upgrade your R version, to download and install clean versions of your most common packages.

    1. Create a new R script and save it with a useful name related to its purpose.

    2. Using comments, give the purpose of the script and what it ought to return when it works correctly.

    3. Write the code that will install all three of the packages you just installed in the install packages exercise.

  3. -- Create function --

    Don’t Repeat Yourself

    If you will be performing actions more than once it is better to create modular code that does the task. The alternative, copy-pasting code and modifying it for the new target, is an error-generating process. It also contains a large inefficiency: when you want to change something about what the code does, you must change it as many times as you have repeated the statement.

    In contrast, a modular piece of code need only be changed once, and in one place, then re-run against the targets. In R, one option for modular code is to create a custom function. Functions take arguments, and return results, but other than that have nearly unlimited flexibility.

    1. R has some quirks. One is that despite being written primarily as a statistical software language, it does not come with a built-in function to calculate the standard error of the mean of a sample. This is a very useful property to calculate in many descriptive statistical summaries.

    2. Create a new script and save it with a useful name that indicates it contains a function for calculating the SEM.

    3. Using comments, document what the function is meant to do, what inputs it uses, and what it is expected to return.

    4. The definition of the standard error of the mean is: the standard deviation of the observations, divided by the square root of the number of observations.
      • the function for calculating standard deviation is sd()
      • the function for taking the square root is sqrt()
      • the function for calculating the number of elements in a vector is length()
    5. Write a function which takes as its arguments a vector of numbers x and returns the standard error of the mean of those numbers. Name the function se_m.

    6. Open a separate script. Make se_m() available in the environment by using the source() function and the name of the function definition script you created.

    7. Calculate the standard error of the mean for the sequence of numbers 1:20.

    Optional: modify your function to operate even when a vector of numbers includes an NA value.

    [click here for output]
  4. -- Read in data --

    After your hard work helping your colleague prepare for the 2018 field season by rearranging their data, you receive a revised dataset for 2016. Your colleague has asked that you make sure you can read in this format and work with it in R before tackling the rearrangement of the 2017 data.

    1. Download the cleaned 2016 WMA bird dataset and save it to the /data directory in your working directory.

    2. Read in the dataset using read.csv(), and assign it to a named object.

    3. Using the str() function, examine the structure of the imported dataset. Do the field types appear to be correct?

    4. type the name of the object at the console and hit enter. Is what is provided useful?

    5. examine the first few rows of data using the head() function.

    6. Create a new object from only the first 100 rows of data. Export this new object to a file called wma-bird-data-2016-first100.csv.