Tamara Shatar ก.ค. 30, 2564
Binary code header

The Janitor Package in R for Cleaning and Examining Data


As a trainer, my role is to show my students better ways of working. If I am successful, they leave my class with the skills to get things done faster, more accurately and with less time spent on boring, repetitive tasks. And as someone who works with data, I am always looking for tools that will make my life easier. Working with the R programming language, there are always new discoveries to be made amongst the nearly 18,000 packages created by the user community.

My latest discovery is the package janitor. It contains easy-to-use and convenient functions for cleaning and examining data. Let's take a look at some of these functions.

1. clean_names()


This function is used to change and clean up names of columns in data frames. It can be used to ensure consistency. You can choose to change all names to snake case (all lower case words, separated by underscores), variations on camel case (internal capital letters between words), title case or other styles. It can also be used to remove parts of names and any special characters, including replacing % symbols with the word percent.

To demonstrate functionality from the janitor package, a dataset was created in Excel.

 janitor image 1
 janitor  image 1




The data were imported into a data frame (df) using the RStudio GUI From Excel… option.

janitor  image 1

 

janitor  image 1



janitor  image 1

 

janitor  image 1




Using the clean_names() function adds consistency to the names, removes spaces in the names and special characters.

janitor  image 1

 

janitor  image 1



janitor  image 1

 

janitor  image 1

 

2. remove_empty()

The dataset contains empty rows and empty columns that can be removed with the remove_empty() function.

janitor  image 1

 

janitor  image 1



janitor  image 1

 

janitor  image 1

 

3. get_dupes()

This function retrieves any duplicates in the dataset so that they can be examined during data clean-up operations. The first argument accepts the name of the data frame, the second and subsequent arguments accept one or more column names. These columns are searched for duplicate values. The function returns a data frame which includes a dupe_count column containing the number of duplicates of that value.

We can search for duplicated measurements on certain dates,

janitor  image 1

 

janitor  image 1



janitor  image 1

 

janitor  image 1



or for duplicated measurements on certain dates, at certain locations.

janitor  image 1

 

janitor  image 1



janitor  image 1

 

janitor  image 1

 

4. tabyl()


This function is used to produce frequency tables and contingency tables, i.e. counts of each category or combination of categories of data. Unlike the base R table() function, tabyl() returns a data frame which makes results easier to work with.

The code below creates a data frame showing the number of rows of data (n) for each location in the dataset. Also returned is a percent column, showing the percentage of rows containing data for that location.

janitor  image 1

 

janitor  image 1



janitor  image 1



janitor  image 1



5. adorn_

Janitor also provides adorn_ functions for formatting tabulated data. adorn_pct_formatting() can be used to format the percentage output.

 

janitor  image 1

 

janitor  image 1

 

janitor  image 1

 

janitor  image 1




We can also return the number of observations for each location on each date.

janitor  image 1

 

janitor  image 1



janitor  image 1

 

janitor  image 1



By default the values in the contingency table are shown as counts. They can be changed to percentages using adorn_percentages().

janitor  image 1

 

janitor  image 1

 

janitor  image 1

 

janitor  image 1




Use janitor functions with tidyverse pipes

If you use tidyverse pipes, you can use janitor functions in your pipelines to streamline data frame clean-up.

janitor  image 1

 

janitor  image 1




janitor  image 1

 

janitor  image 1




Learn more about janitor at the CRAN site.

If you're new to R, check out our R training course and certifications.

Related Topics

Contact Us

Why Nexacu? 

Valued by Individuals

4.72 / 5
Over 77289 Reviews

Trusted by Business

Awards and Accreditations

Follow us