Package Design vignette for {cleanepi}

Concept and motivation

In this document, we will outline the design decisions that have steered the development strategies of the {cleanepi} R package, along with the rationale behind each decision and the potential advantages and disadvantages associated with them.

Data cleaning is an important phase for ensuring the efficacy of downstream analysis. The procedures entailed in the cleaning process may differ based on the data type and research objectives. Nonetheless, certain steps can be applied universally across diverse data types, irrespective of their origin.

Design decisions

The {cleanepi} R package is designed to offer functional programming-style data cleansing tasks. To streamline the organization of data cleaning operations, we have categorized them into distinct groups referred to as modules. These modules are based on overarching goals derived from commonly anticipated data cleaning procedures. Each module features a primary function along with additional helper functions tailored to accomplish specific tasks. It’s important to note that, except for few cases where the outcome a helper function can impact on the cleaning task, only the main function of each module will be exported. This deliberate choice empowers users to execute individual cleaning tasks as needed, enhancing flexibility and usability.

At the core of {cleanepi}, the pivotal function clean_data() serves as a wrapper encapsulating all the modules, as illustrated in the figure above. This function is intended to be the primary entry point for users seeking to cleanse their data. It performs the cleaning operations as requested by the user through the set of parameters that need to be explicitly defined. Furthermore, multiple cleaning operations can be performed sequentially using the “pipe” operators (|> or %>%). In addition, this package also has two surrogate functions:

  1. scan_data(): This function enables users to assess the data types present in each column of their dataset.
  2. print_report(): By utilizing this function, users can visualize the report generated from each applied cleaning task, facilitating transparency and understanding of the data cleaning process.

Scope

{cleanepi} is an R package crafted to clean, curate, and standardize tabular datasets, with a particular focus on epidemiological data. In the architecture of {cleanepi}, the data cleaning operations are categorized into modules, each provides a specific data cleaning task. The modules in the current version of {cleanepi} encompass the:

By compartmentalizing these operations into modules, {cleanepi} offers users a systematic and adaptable framework to address diverse data cleaning needs, especially within the realm of epidemiological datasets.

Input

The primary functions of the modules, as well as the core function clean_data(), accept input in the form of a data.frame or linelist. This offers flexibility for users regarding where they want to position {cleanepi} within the R package ecosystem for epidemic analysis pipelines, either to clean data before or after converting it to a linelist.

In addition to the target dataset, the core function clean_data() accepts a list of operations to be executed on the dataset. It subsequently invokes the primary functions specified for each module.

Output

Both the primary functions of the modules and the core function clean_data() return an object of type data.frame or linelist, depending on the type of the input dataset. The report generated from all cleaning tasks is attached to this object as an attribute, which can be accessed using the attr() function in base R.

Modules in {cleanepi}

In this section, we provide a detailed description of the way that every module is built.

1. Standardization of column names

This module is designed to standardize the style and format of column names within the target dataset, offering users the flexibility to specify a subset of:

By incorporating the standardize_column_names() function, {cleanepi} streamlines the process of ensuring consistency and clarity in column naming conventions, thereby enhancing the overall organization and readability of the dataset.

2. Removal of empty rows and columns and constant columns

This module aims at eliminating irrelevant and redundant rows and columns, including empty rows and columns as well as constant columns.

3. Detection and removal of duplicates

This module is designed to identify and eliminate duplicated rows.

Through the remove_duplicates() function, users can streamline their dataset by eliminating redundant rows, thus enhancing data integrity and analysis efficiency.

4. Replacement of missing values with NA

This module aims to standardize and unify the representation of missing values within the dataset.

By utilizing the replace_missing_char() function, users can ensure consistency in handling missing values across their dataset, facilitating accurate analysis and interpretation of the data.

5. Standardization of date values

This module is dedicated to convert date values in character columns into Date value in ISO8601 format, and ensuring that all dates fall within the given timeframe.

By employing the standardize_dates() function, users can ensure uniformity and coherence in date formats across their dataset, while also validating the temporal integrity of the data within the defined timeframe.

6. Standardization of subject IDs

This module is tailored to verify whether the values in the column uniquely identifying subjects adhere to a consistent format. It also offers a functionality that allow users to correct the inconsistent subject ids.

By utilizing the functions in this module, users can ensure uniformity in the format of subject ids, facilitating accurate tracking and analysis of individual subjects within the dataset.

7. Dictionary based substitution

This module facilitates dictionary-based substitution, which involves replacing existing values with predefined ones. It replaces entries in a specific columns to certain values, such as substituting 1 with “male” and 2 with “female” in a gender column. It also interoperates seamlessly with the get_meta_data() function from {readepi} R package.

By leveraging the clean_using_dictionary() function, users can streamline and standardize the values within specific columns based on predefined mappings, enhancing consistency and accuracy in the dataset.

Note that the clean_using_dictionary() function will return a warning when it detects unexpected values in the target columns from the data dictionary. The unexpected values can be added to the data dictionary using the add_to_dictionary() function.

8. Conversion of values when necessary

This module is designed to convert numbers written in letters to numerical values, ensuring interoperability with the {numberize} package.

By employing the convert_to_numeric() function, users can seamlessly transform numeric representations written in letters into numerical values, ensuring compatibility with the {numberize} package and promoting accuracy in numerical analysis.

Note that convert_to_numeric() will issue a warning for unexpected values and return them in the report.

9. Verification of the sequence of date-events

This module provides functions to verify whether the sequence of date events aligns with expectations. For instance, it can flag rows where the date of admission to the hospital precedes the individual’s date of birth.

By using the check_date_sequence() function, users can systematically validate and ensure the coherence of date sequences within their dataset, promoting accuracy and reliability in subsequent analyses.

10. Transformation of selected columns

This module is dedicated to performing various specialized operations related to epidemiological data analytics, and it currently includes the following functions:

By leveraging the timespan() function, users can efficiently compute and integrate time span information into their epidemiological dataset based on user-defined parameters, enhancing the analytics capabilities of the dataset.

Surrogate functions

  1. scan_data(): This function is designed to generate a quick summary of the dataset, offering insights into the composition of each column. It calculates the percentage of values belonging to different data types such as character, numeric, missing, logical, and date. This summary can help analysts and data scientists understand the structure and content of the dataset at a glance.

  2. print_report(): This function is used for displaying the report detailing the result of the cleaning operations executed on the dataset. It likely presents information about the data cleaning processes performed, such as handling missing values, correcting data types, removing duplicates, and any other transformations applied to ensure data quality and integrity.

These surrogate functions play crucial roles in the data analysis and cleaning workflow, providing valuable information and documentation about the dataset characteristics and the steps taken to prepare it for analysis or modelling.

Dependencies

The modules and surrogate functions will depend mainly on the following packages:

{numberize} used for the conversion of number from character to numeric, {dplyr} used in many way including filtering, column creation, data summary, etc, {magrittr} used here for its %>% operator, {linelist} used to perform some operations on linelist-type input objects, {janitor} used here for the removal of constant data (empty rows and columns, as well as constant columns), {matchmaker} utilized to perform the dictionary-based cleaning, {lubridate} used to create, handle, and manipulate objects of type Date, {reactable} mainly used here to customize the data cleaning report, {arsenal} used in standardizing column names, {glue} used here in substitution of paste() and paste0() to avoid linters, {snakecase} used in standardizing column names to transform everything into snake-case except when specified otherwise, {withr} utilized to handle the creation of temporary files and directory relevant for print_report() and, {readr} used to import data.

The functions will require all other packages that needed in the package development process including:

{checkmate}, {kableExtra}, {bookdown}, {rmarkdown}, {testthat} (>= 3.0.0), {knitr}, {lintr}

Contribute

There are no special requirements to contributing to {cleanepi}, please follow the package contributing guide.