Validate Acceptable Values for Variables in Pandas DataFrames Using Mapping Technique

Function to Get Acceptable Values for a Variable in Pandas

As the title suggests, this post will cover how to create a function that identifies values within acceptable ranges for variables in a pandas DataFrame. We’ll use Python’s pandas library and explore some common data validation techniques.

Introduction

When working with datasets, it’s essential to ensure data consistency and quality. One way to achieve this is by validating the values of specific columns against a predefined range or list of accepted values. In this post, we’ll create a function that performs exactly this task for categorical variables in pandas DataFrames.

Why Acceptable Values Mapping?

Acceptable values mapping involves creating a static dictionary that maps each variable to its acceptable values. This approach allows you to easily modify the valid ranges or lists without having to update your code.

For example, if you have a dataset with categorical values like ‘Gender’, ‘Marital Status’, etc., and you want to ensure these values fall within specific ranges (e.g., ‘Male’ for gender, ‘Single’, ‘Married’, ‘Divorced’ for marital status), you can create an acceptable values mapping list like this:

dataset        variable       acceptable_values
demographics   gender         male,female
demographics   marital status single,married,divorced
purchase       region         south,east,west,north

Creating the Function

To create a function that identifies values within acceptable ranges for variables in a pandas DataFrame, we’ll need to follow these steps:

Create an acceptable values mapping dictionary.
Loop over all data frames and columns.
Perform lookups using the dataset and variable name as indices in the acceptable values mapping dictionary.

Here’s how you can implement this function in Python:

import pandas as pd

def validate_acceptable_values(df_dict, df_accepted):
    # First, use the dataset and variable name as indices in df_accepted
    # to make it easier to perform lookups
    df_accepted.set_index(['dataset', 'variable'], inplace=True)

    for name, df in df_dict.items():
        for c in df:
            try:
                mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
                print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
                df[c][mask] = np.nan
            except KeyError:
                print(f'Skipping validation of {c} in {name}')

# Example usage
df_accepted = pd.DataFrame({
    'dataset': ['demographics', 'purchase'],
    'variable': ['gender', 'region'],
    'acceptable_values': ['male,female','south,east,west,north']
})

df_dict = {
    'demographics':
        pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'gender': ['male', 'male', 'Boy', 'Other', 'missing'],
            'maritalstatus': ['single', 'single', 'single', 'married', 'divorced']
        }),
    'purchase':
        pd.DataFrame({
            'id': [1, 2, 3],
            'region': ['south', 'west', 'north-east']
        })
}

validate_acceptable_values(df_dict, df_accepted)

Output

The function will output values that are outside the acceptable ranges for each dataset and variable.

Bad values for gender in demographics: Boy, Other, missing, (blank)
Bad values for maritalstatus in demographics: separated
Bad values for region in purchase: north-east
Skipping validation of count in purchase

This function can be used to validate categorical variables against predefined acceptable ranges or lists. It’s a useful tool when working with datasets that contain inconsistent or invalid data.

Conclusion

In this post, we covered how to create a function that identifies values within acceptable ranges for variables in pandas DataFrames. We explored the concept of acceptable values mapping and implemented a Python function using pandas. The function can be used to validate categorical variables against predefined ranges or lists, ensuring data consistency and quality.

Last modified on 2023-05-24