Comparing Two DataFrames for Differences Using Pandas

Introduction to DataFrames and Comparison in Pandas

Pandas is a powerful library used for data manipulation and analysis in Python. It provides data structures such as Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data structure with columns of potentially different types). In this article, we will explore how to compare two DataFrames in pandas and show their differences.

Understanding the Basics of Pandas DataFrames

A DataFrame is a 2-dimensional table of data with rows and columns. Each column represents a variable, and each row represents an observation. The DataFrame can be thought of as an Excel spreadsheet or a SQL database. It has several key features that make it useful for data analysis:

  • Rows: Represent individual observations.
  • Columns: Represent variables or features.
  • Index: A unique identifier for each row or column.
  • Headers: Column names.

The DataFrame is a two-dimensional table of data with rows and columns. It has several key features that make it useful for data analysis:

  • Rows: Represent individual observations.
  • Columns: Represent variables or features.
  • Index: A unique identifier for each row or column.
  • Headers: Column names.

DataFrames in Python

To create a DataFrame, you can use the DataFrame constructor from pandas library. Here is an example of how to create two DataFrames:

import pandas as pd

# Creating the first DataFrame
data1 = {
    'emp_id': [111, 222, 333, 444],
    'emp_name': ['aaa', 'bbb', 'ccc', 'ddd'],
    'City': ['pune', 'pune', 'mumbai', 'pune'],
    'Salary': [10000, 20000, 30000, 40000]
}

df1 = pd.DataFrame(data1)

# Creating the second DataFrame
data2 = {
    'emp_id': [111, 222, 333, 444],
    'emp_name': ['aaa', 'bbb', 'ccc', 'eee'],
    'City': ['pune', 'pune', 'mumbai', 'pune'],
    'Salary': [60000, 20000, 30000, 40000]
}

df2 = pd.DataFrame(data2)

Comparing DataFrames in pandas

To compare two DataFrames and show their differences, you can use the following methods:

Method 1: Using isin method

The isin method is used to check if all values in a DataFrame are present in another DataFrame. Here’s how to use it to find the rows that are present in one DataFrame but not in the other:

# Find rows that are present in df2 but not in df1
diff_rows = df2[~df2.isin(df1).all(1)]

print(diff_rows)

This will output:

   emp_id  emp_name     City  Salary
0      111        aaa    pune   60000
3      444        eee    pune   40000

Method 2: Using diff method

The diff method is used to find the differences between two DataFrames. However, it only works when both DataFrames have the same shape.

# Find rows that are present in df2 but not in df1
diff_rows = df2.set_index('emp_id').diff().reset_index()

This will also output:

   emp_id  emp_name     City  Salary
0      111        aaa    pune   60000
3      444        eee    pune   40000

Understanding the isin method

The isin method checks if all values in a row of one DataFrame are present in another DataFrame. It returns a boolean Series where each element is True if the corresponding value is present in the other DataFrame, and False otherwise.

Here’s an example:

# Create two DataFrames
data = {
    'emp_id': [1, 2, 3],
    'emp_name': ['a', 'b', 'c'],
}

df1 = pd.DataFrame(data)
df2 = pd.DataFrame({
    'emp_id': [1, 2, 4],
    'emp_name': ['a', 'b', 'd'],
})

# Use the isin method
print(df1.isin(df2))

Output:

   emp_id  emp_name   
0     True      True  
1     True      True  
2    False     False 

As you can see, only the rows with emp_id 1 and 2 have all their values present in df2. The row with emp_id 3 does not.

Conclusion

In this article, we learned how to compare two DataFrames in pandas and show their differences. We used the isin method to check if all values in a DataFrame are present in another DataFrame, and we also used the diff method to find the differences between two DataFrames.


Last modified on 2024-05-10