Learning Objectives
Following this assignment students should be able to:
- recognize bad data organization and why it is problematic
- implement quality control for data entry in spreadsheets
Reading
- The Data Life Cycle
- Big data and the future of ecology
- Scientists losing data at a rapid rate
- Quality Assurance and Control
Lecture Notes
Exercises
-- Improving Messy Data --
MS Excel is a visual data medium–a spreadsheet. People naturally use the variety of visually oriented tools available, as well as their own intuition of how to organize data, when digitizing field notes or data forms into Excel.
This visual intuition breaks down when the data captured in the spreadsheet need to be accessed by data software for summary or analysis. Things people can gloss over without losing meaning become serious impediments to understanding for most software. Thus the challenge for the data manager inside each biologist is to think like a computer when handling data.
Imagine you have been asked to help out a colleague, a well-trained birder who has been sampling some of Minnesota’s Wildlife Management Areas. Download the bird dataset containing some of data from two years of this survey work. Note that there are multiple tabs in this spreadsheet.
-
Find the mean and variance of the number birds observed per species for each WMA (a) in each year and (b) across years.
-
Think about what could be improved about this data. With a partner or in a small group, identify five things about this data that cause problems when trying to summarize or analyze these data.
What would you do to fix each of those issues?
-
Evidently between the 2016 and 2017 field season, your birder colleague heard something about long versus wide data and switched up the data entry. Do you prefer one or the other approach? How would you combine the two?
-
-- Data entry validation in Excel --
Refer to the bird dataset about which your colleague asked for assistance.
You decide to create a new data pipeline from scratch for the 2018 field season. Create a spreadsheet in Excel for data entry.
Be sure to use the data validation criteria to prevent some of the challenges you identified with the data earlier. The goal is to prevent obviously wrong values from being entered, and to create homogenous data columns.
Some examples of things to try:
- Year must be an integer with a value in the study period.
- Site should be one of the WMAs.
- Species should be one of the identified list.
- Dates should be standardized, and units should be specified.
Save this file as
yourname_WMA_bird_data_entry.xlsx
.