Tidy Data

Improving Messy Data


MS Excel is a visual data medium–a spreadsheet. People naturally use the variety of visually oriented tools available, as well as their own intuition of how to organize data, when digitizing field notes or data forms into Excel.

This visual intuition breaks down when the data captured in the spreadsheet need to be accessed by data software for summary or analysis. Things people can gloss over without losing meaning become serious impediments to understanding for most software. Thus the challenge for the data manager inside each biologist is to think like a computer when handling data.

Imagine you have been asked to help out a colleague, a well-trained birder who has been sampling some of Minnesota’s Wildlife Management Areas. Download the bird dataset containing some of data from two years of this survey work. Note that there are multiple tabs in this spreadsheet.

  1. Find the mean and variance of the number birds observed per species for each WMA (a) in each year and (b) across years.

  2. Think about what could be improved about this data. With a partner or in a small group, identify five things about this data that cause problems when trying to summarize or analyze these data.

    What would you do to fix each of those issues?

  3. Evidently between the 2016 and 2017 field season, your birder colleague heard something about long versus wide data and switched up the data entry. Do you prefer one or the other approach? How would you combine the two?

[click here for output]