Always check your input data

Author

Mark Dunning

Published

October 16, 2025

Overview

Why you always check your input data before starting

A cautionary tale

As part of creating this website I decided to revise and re-compile Part 3 of my introductory R series. I had to take a long train journey (this will become relevant later…) and it seemed like an ideal time to do so. I have used the materials numerous times before, and am following best-practices such as keeping my code under version control, using quarto for reproducible documentation, but yet one of the plots just didn’t look right.

Here I am supposed to be plotting the number of COVID cases over time for a number of countries. It probably won’t take long to spot that the line for United Kingdom shows an extremely small number of cases - which certainly isn’t true. I was running the same code I had done countless times before, so what could the problem be? It should be noted that at no point did RStudio show any error messages, so as far as R was concerned it what behaving perfectly well.

Overview of the code

Warning in dir.create("raw_data"): 'raw_data' already exists

The code chunk below takes a time series dataset of covid cases worldwide and applies some essential data cleaning transformations. These ensure that the data are in a “tidy” format expected by ggplot2 and convert the dates into an international standards. Furthermore, some countries are represented by multiple regions and for simplicity we add these case numbers together.

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)

covid <- read_csv("raw_data/time_series_covid19_confirmed_global.csv") %>% 
  rename(country = `Country/Region`) %>% 
  pivot_longer(5:last_col(),names_to="Date", values_to="Cases") %>% 
  mutate(Date=as.Date(Date,"%m/%d/%y")) %>% 
  group_by(country,Date) %>% 
  summarise(Cases = sum(Cases))

Rows: 289 Columns: 1147

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr    (2): Province/State, Country/Region
dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

head(covid)

# A tibble: 6 × 3
# Groups:   country [1]
  country     Date       Cases
  <chr>       <date>     <dbl>
1 Afghanistan 2020-01-22     0
2 Afghanistan 2020-01-23     0
3 Afghanistan 2020-01-24     0
4 Afghanistan 2020-01-25     0
5 Afghanistan 2020-01-26     0
6 Afghanistan 2020-01-27     0

The plot is now a standard application of the ggplot function

filter(covid, country %in% c("United Kingdom","France","Spain")) %>%
  ggplot(aes(x = Date, y = Cases,col=country)) + geom_line()

But the plot is now looking as expected - with United Kingdom showing high numbers of cases. So what could have happened to produce the top of the page? I neglected to explain that the source data come from a github page are were downloaded as part of my code. In the training materials this was intended to show a workflow that started from data located at a remote source. The code below first creates a raw_data folder (without complaining if such a folder already exists - showWarnings=FALSE) and then checks via file.exists if time_series_covid19_confirmed_global.csv is already present. If not, the code will download from github.

dir.create("raw_data", showWarnings = FALSE)
if(!file.exists("raw_data/time_series_covid19_confirmed_global.csv")){
  download.file("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",destfile = "raw_data/time_series_covid19_confirmed_global.csv")
}

What should happen, and missing from the workshop materials, is to perform some basic checks on the dimensions of the data once imported into R and stored as a tibble. 🤦

covid <- read_csv("raw_data/time_series_covid19_confirmed_global.csv")

Rows: 289 Columns: 1147
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr    (2): Province/State, Country/Region
dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Checking the dimensions will print the number of rows and columns

dim(covid)

[1]  289 1147

The head function is a classic function for printing the first six rows (adjustable using the n argument)

head(covid)

# A tibble: 6 × 1,147
  `Province/State` `Country/Region`   Lat  Long `1/22/20` `1/23/20` `1/24/20`
  <chr>            <chr>            <dbl> <dbl>     <dbl>     <dbl>     <dbl>
1 <NA>             Afghanistan       33.9 67.7          0         0         0
2 <NA>             Albania           41.2 20.2          0         0         0
3 <NA>             Algeria           28.0  1.66         0         0         0
4 <NA>             Andorra           42.5  1.52         0         0         0
5 <NA>             Angola           -11.2 17.9          0         0         0
6 <NA>             Antarctica       -71.9 23.3          0         0         0
# ℹ 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
#   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
#   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
#   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
#   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
#   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
#   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, …

Similarly, tail will show the last rows in the data

tail(covid)

# A tibble: 6 × 1,147
  `Province/State` `Country/Region`      Lat  Long `1/22/20` `1/23/20` `1/24/20`
  <chr>            <chr>               <dbl> <dbl>     <dbl>     <dbl>     <dbl>
1 <NA>             Vietnam              14.1 108.          0         2         2
2 <NA>             West Bank and Gaza   32.0  35.2         0         0         0
3 <NA>             Winter Olympics 20…  39.9 116.          0         0         0
4 <NA>             Yemen                15.6  48.5         0         0         0
5 <NA>             Zambia              -13.1  27.8         0         0         0
6 <NA>             Zimbabwe            -19.0  29.2         0         0         0
# ℹ 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
#   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
#   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
#   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
#   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
#   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
#   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, …

I saved a copy of the csv file that was downloaded during the same session that created the erroneous covid line plot.

covid_bad <- read_csv("time_series_covid19_confirmed_global_BAD.csv")

Rows: 270 Columns: 1147
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr    (2): Province/State, Country/Region
dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The first rows of the tibble look to be the same

head(covid)

# A tibble: 6 × 1,147
  `Province/State` `Country/Region`   Lat  Long `1/22/20` `1/23/20` `1/24/20`
  <chr>            <chr>            <dbl> <dbl>     <dbl>     <dbl>     <dbl>
1 <NA>             Afghanistan       33.9 67.7          0         0         0
2 <NA>             Albania           41.2 20.2          0         0         0
3 <NA>             Algeria           28.0  1.66         0         0         0
4 <NA>             Andorra           42.5  1.52         0         0         0
5 <NA>             Angola           -11.2 17.9          0         0         0
6 <NA>             Antarctica       -71.9 23.3          0         0         0
# ℹ 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
#   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
#   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
#   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
#   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
#   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
#   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, …

But clearly there are fewer rows

dim(covid_bad)

[1]  270 1147

And the last rows of the tibble are not the same as the complete dataset.

tail(covid_bad)

# A tibble: 6 × 1,147
  `Province/State`   `Country/Region`   Lat   Long `1/22/20` `1/23/20` `1/24/20`
  <chr>              <chr>            <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
1 Anguilla           United Kingdom    18.2 -63.1          0         0         0
2 Bermuda            United Kingdom    32.3 -64.8          0         0         0
3 British Virgin Is… United Kingdom    18.4 -64.6          0         0         0
4 Cayman Islands     United Kingdom    19.3 -81.3          0         0         0
5 Channel Islands    United Kingdom    49.4  -2.36         0         0         0
6 Falkland Islands … United Kingdom   -51.8 -59.5          0         0         0
# ℹ 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
#   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
#   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
#   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
#   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
#   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
#   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, …

What is incredibly unlucky in my case was that not all of the United Kingdom rows are present in the shorter dataset.

filter(covid, `Country/Region` == "United Kingdom")

# A tibble: 15 × 1,147
   `Province/State`          `Country/Region`    Lat    Long `1/22/20` `1/23/20`
   <chr>                     <chr>             <dbl>   <dbl>     <dbl>     <dbl>
 1 Anguilla                  United Kingdom    18.2   -63.1          0         0
 2 Bermuda                   United Kingdom    32.3   -64.8          0         0
 3 British Virgin Islands    United Kingdom    18.4   -64.6          0         0
 4 Cayman Islands            United Kingdom    19.3   -81.3          0         0
 5 Channel Islands           United Kingdom    49.4    -2.36         0         0
 6 Falkland Islands (Malvin… United Kingdom   -51.8   -59.5          0         0
 7 Gibraltar                 United Kingdom    36.1    -5.35         0         0
 8 Guernsey                  United Kingdom    49.4    -2.59         0         0
 9 Isle of Man               United Kingdom    54.2    -4.55         0         0
10 Jersey                    United Kingdom    49.2    -2.14         0         0
11 Montserrat                United Kingdom    16.7   -62.2          0         0
12 Pitcairn Islands          United Kingdom   -24.4  -128.           0         0
13 Saint Helena, Ascension … United Kingdom    -7.95  -14.4          0         0
14 Turks and Caicos Islands  United Kingdom    21.7   -71.8          0         0
15 <NA>                      United Kingdom    55.4    -3.44         0         0
# ℹ 1,141 more variables: `1/24/20` <dbl>, `1/25/20` <dbl>, `1/26/20` <dbl>,
#   `1/27/20` <dbl>, `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
#   `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>,
#   `2/4/20` <dbl>, `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>,
#   `2/8/20` <dbl>, `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
#   `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>,
#   `2/16/20` <dbl>, `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, …

Here are the rows for United Kingdom in the truncated data

filter(covid_bad, `Country/Region` == "United Kingdom")

# A tibble: 6 × 1,147
  `Province/State`   `Country/Region`   Lat   Long `1/22/20` `1/23/20` `1/24/20`
  <chr>              <chr>            <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
1 Anguilla           United Kingdom    18.2 -63.1          0         0         0
2 Bermuda            United Kingdom    32.3 -64.8          0         0         0
3 British Virgin Is… United Kingdom    18.4 -64.6          0         0         0
4 Cayman Islands     United Kingdom    19.3 -81.3          0         0         0
5 Channel Islands    United Kingdom    49.4  -2.36         0         0         0
6 Falkland Islands … United Kingdom   -51.8 -59.5          0         0         0
# ℹ 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
#   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
#   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
#   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
#   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
#   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
#   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, …

What’s worse is that the row containing the covid cases for mainland Uk are only present in the full dataset. This is where most of the cases occur.

filter(covid, `Country/Region` == "United Kingdom", is.na(`Province/State`))

# A tibble: 1 × 1,147
  `Province/State` `Country/Region`   Lat  Long `1/22/20` `1/23/20` `1/24/20`
  <chr>            <chr>            <dbl> <dbl>     <dbl>     <dbl>     <dbl>
1 <NA>             United Kingdom    55.4 -3.44         0         0         0
# ℹ 1,140 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
#   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
#   `2/1/20` <dbl>, `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>,
#   `2/5/20` <dbl>, `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
#   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>,
#   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
#   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, …

filter(covid_bad, `Country/Region` == "United Kingdom", is.na(`Province/State`))

# A tibble: 0 × 1,147
# ℹ 1,147 variables: Province/State <chr>, Country/Region <chr>, Lat <dbl>,
#   Long <dbl>, 1/22/20 <dbl>, 1/23/20 <dbl>, 1/24/20 <dbl>, 1/25/20 <dbl>,
#   1/26/20 <dbl>, 1/27/20 <dbl>, 1/28/20 <dbl>, 1/29/20 <dbl>, 1/30/20 <dbl>,
#   1/31/20 <dbl>, 2/1/20 <dbl>, 2/2/20 <dbl>, 2/3/20 <dbl>, 2/4/20 <dbl>,
#   2/5/20 <dbl>, 2/6/20 <dbl>, 2/7/20 <dbl>, 2/8/20 <dbl>, 2/9/20 <dbl>,
#   2/10/20 <dbl>, 2/11/20 <dbl>, 2/12/20 <dbl>, 2/13/20 <dbl>, 2/14/20 <dbl>,
#   2/15/20 <dbl>, 2/16/20 <dbl>, 2/17/20 <dbl>, 2/18/20 <dbl>, …

We can filter the data for United Kingdom and use select to show the last column. The helper function last_col() is incredibly useful for this as we don’t need to know the name of the column. Using summarise we can add all the values in the column. This is the cumulative number of casees at the last date in the dataset. Dividing by one million (1e6) makes the numbers a bit easier to read.

filter(covid, `Country/Region` == "United Kingdom") %>% 
  select(last_col()) %>% 
  summarise(sum(.) / 1e6)

# A tibble: 1 × 1
  `sum(.)/1e+06`
           <dbl>
1           24.7

For the truncated dataset, the number of counts is substantially lower as we are missing the mainland Uk data. So it is not surprisingly that the trend lines for the Uk did not look correct.

filter(covid_bad, `Country/Region` == "United Kingdom") %>% 
  select(last_col()) %>% 
  summarise(sum(.))

# A tibble: 1 × 1
  `sum(.)`
     <dbl>
1    63439

In this particular instance I think my poor internet connection on the train wi-fi must have caused me to not download the complete file. Since no errors were produced I might not have detected the problem if I wasn’t trying to visualise data for United Kingdom which happened to be located towards the bottom of the file.

Checking the dimensions of the tibble help to diagnose the problem, but there are couple of other techniques. R has an in-built function to check the size of a file.

file.size("raw_data/time_series_covid19_confirmed_global.csv")

[1] 1820194

file.size("time_series_covid19_confirmed_global_BAD.csv")

[1] 1714255

A bit more rigourous is to calculate a checksum. In short, this is a calculated value that represents the exact contents of a file. If the file changes — even by a single byte — the checksum changes as well. That’s why it’s often described as a “digital fingerprint” for data integrity. This can be done in R by first loading the tools library and using the md5sum function.

# Load the tools package
library(tools)

# Compute MD5 checksum for a file
file_path <- "path/to/your/file.txt"
checksum <- md5sum("raw_data/time_series_covid19_confirmed_global.csv")
checksum

raw_data/time_series_covid19_confirmed_global.csv 
               "095dfec62d4981a32630bed5094296f7"

checksum_bad <- md5sum("time_series_covid19_confirmed_global_BAD.csv")
checksum_bad

time_series_covid19_confirmed_global_BAD.csv 
          "179af9d326d1c9451c49e5d9821c85e0"

In practice if someone is sending data you large, especially if the file is large, they will also send a file containing checksums for you to check data integrity. The following would print TRUE if both files had exactly the same contents.

checksum == checksum_bad

raw_data/time_series_covid19_confirmed_global.csv 
                                            FALSE

Conclusion

So I should have checked the input data more thoroughly before plotting. Thankfully it was an example for a workshop, but even so I should be demonstrating best practices. I was particularly unlucky that the data stopped halfway through the particular subset I was trying to plot.

There are other ways of checking that your outputs are correct involving something called “unit tests”, which are primarily used during software development but could also be useful in this context. I may just talk about these in the future.

list.files("raw_data/")

[1] "time_series_covid19_confirmed_global.csv"