Member-only story

Fastest way to compare two data frames with millions of rows but a common column in python

2 min readMar 27, 2023

When comparing two large dataframes with a common key, such as an email id field, you can use merge operations in Pandas. However, using the default `merge` or `join` functions could take a significant amount of time to complete when dealing with millions of rows.

To speed up this process, you can consider using a technique called “join hints”. Join hints are used to optimize join performance by specifying certain details about the dataset and the join operation to help pandas choose the best algorithmic approach to use.

Here’s a list of join hints that you can use in Pandas:

- `broadcast`: When joining a large dataframe to a smaller one, broadcast the smaller table across all the partitions of the bigger table.
- `shuffle`: Use a shuffle-based join algorithm that involves shuffling the data across all nodes.
- `hash`: Use hash-based join algorithm that involves computing a hash function on join key to identify the matched rows between two tables.
- `sorted`: Use sort-based join wherein tables are sorted according to the join key before performing the join operation.

Here’s an example of how you can use join hints to perform a more efficient join operation on two dataframes with millions of rows:

```
import pandas as pd

Fastest way to compare two data frames with millions of rows but a common column in python

Written by Milind Soorya

No responses yet