4. Joining Multiple Pandas Dataframes

In this unit we will discuss strategies for dealing with multiple dataframes and combing them into a single dataframe.

Concatenation

Concatenation appends the rows of one or more dataframes together. This is a row-oriented operation so the resulting dataframe will be longer. For example, if a 50-row dataframe is concatenated with a 40-row dataframe, you will have a 90-row dataframe.

The pd.concat() function is used to concatenate frames. It takes several arguments, but the most often-used ones are: pd.concat(items: list[pd.DataFrame], ignore_index=False). The first argument is a list of Dataframes to concat. The ignore_index keyword argument governs what happens to the indices in the combined frame: we can choose to keep the current index in each dataframe ignore_index=False or create a new index ignore_index=True.

An example:

import pandas as pd
campus_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/campus-students.csv")
campus_students
Name Grade Year
0 Helen NaN Sophomore
1 Iris 10.0 Senior
2 Jimmy 8.0 Freshman
3 Karen NaN Freshman
4 Lynne 10.0 Sophomore
5 Mike 10.0 Sophomore
6 Nico NaN Junior
7 Pete 8.0 Freshman
online_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/online-students.csv")
online_students
Name Grade Year Location
0 Abby 7.0 Freshman NY
1 Bob 9.0 Sophomore CA
2 Chris 10.0 Senior CA
3 Dave 8.0 Freshman NY
4 Ellen 7.0 Sophomore TX
5 Fran 10.0 Senior FL
6 Greg 8.0 Freshman NY
combined_students = pd.concat([campus_students, online_students])
combined_students
Name Grade Year Location
0 Helen NaN Sophomore NaN
1 Iris 10.0 Senior NaN
2 Jimmy 8.0 Freshman NaN
3 Karen NaN Freshman NaN
4 Lynne 10.0 Sophomore NaN
5 Mike 10.0 Sophomore NaN
6 Nico NaN Junior NaN
7 Pete 8.0 Freshman NaN
0 Abby 7.0 Freshman NY
1 Bob 9.0 Sophomore CA
2 Chris 10.0 Senior CA
3 Dave 8.0 Freshman NY
4 Ellen 7.0 Sophomore TX
5 Fran 10.0 Senior FL
6 Greg 8.0 Freshman NY

concat() - Ignoring the index

As you can see from the code above the index from the original DataFrames was used. For example Helen and Abby both share the index 0. This means that if you provide index 0, you’ll get both of them, e.g.:

combined_students['Name'][0]
0    Helen
0     Abby
Name: Name, dtype: str

While this is acceptable, there are situations where a new index based on combined values is desirable. We will encounter this later when grouping or pivioting data.

To make this happen include the ignore_index=True named argument. This will create a new index from the output DataFrame.

students = pd.concat([campus_students, online_students], ignore_index=True)
students
Name Grade Year Location
0 Helen NaN Sophomore NaN
1 Iris 10.0 Senior NaN
2 Jimmy 8.0 Freshman NaN
3 Karen NaN Freshman NaN
4 Lynne 10.0 Sophomore NaN
5 Mike 10.0 Sophomore NaN
6 Nico NaN Junior NaN
7 Pete 8.0 Freshman NaN
8 Abby 7.0 Freshman NY
9 Bob 9.0 Sophomore CA
10 Chris 10.0 Senior CA
11 Dave 8.0 Freshman NY
12 Ellen 7.0 Sophomore TX
13 Fran 10.0 Senior FL
14 Greg 8.0 Freshman NY

Best practice - data lineage

When combining datasets, its a really good idea to retain data lineage, or a record of where the data came from. This can be done by added a column to each dataframe before concatenating.

In this example we create a student "type" column to track lineage.

campus_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/campus-students.csv")
campus_students['type'] = 'campus'
online_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/online-students.csv")
online_students['type'] = 'online'
students = pd.concat([campus_students, online_students], ignore_index=True)
students
Name Grade Year type Location
0 Helen NaN Sophomore campus NaN
1 Iris 10.0 Senior campus NaN
2 Jimmy 8.0 Freshman campus NaN
3 Karen NaN Freshman campus NaN
4 Lynne 10.0 Sophomore campus NaN
5 Mike 10.0 Sophomore campus NaN
6 Nico NaN Junior campus NaN
7 Pete 8.0 Freshman campus NaN
8 Abby 7.0 Freshman online NY
9 Bob 9.0 Sophomore online CA
10 Chris 10.0 Senior online CA
11 Dave 8.0 Freshman online NY
12 Ellen 7.0 Sophomore online TX
13 Fran 10.0 Senior online FL
14 Greg 8.0 Freshman online NY
CautionCode Challenge 4.4.1: classic use case for concatenation

Consider the JSON file here: https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees-dict.json You can find a copy of this file in your class-notebook repository, in data-wrangling/data/json-examples/employees-dict.json.

Let’s take a look at the structure of this file in a dictionary format:

import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees-dict.json")
employees = response.json()
employees
{'accounting': [{'firstName': 'John', 'lastName': 'Doe', 'age': 23},
  {'firstName': 'Mary', 'lastName': 'Smith', 'age': 32}],
 'sales': [{'firstName': 'Sally', 'lastName': 'Green', 'age': 27},
  {'firstName': 'Jim', 'lastName': 'Galley', 'age': 41}],
 'marketing': [{'firstName': 'Tom', 'lastName': 'Brown', 'age': 28}]}

If you are working in your class-notebook repository, you can load the file using:

import json
# note that we are assuming your current working directory (i.e., where you are
# in your file system when running your notebook) is data-wrangling. If your
# are in your class-notebook directory, then change the file path to
# data-wrangling/data/json-examples/exmployees-dict.json
with open('data/json-examples/employees-dict.json', 'r') as fp:
    employees = json.load(fp)
employees

The issue with the JSON data is that there are employees under keys by department "accounting", "sales", "marketing":

print("departments", employees.keys())
departments dict_keys(['accounting', 'sales', 'marketing'])

This is the classic use-case for pd.concat() as there is no practical way to use pd.json_normalize() to get all the employees under each department.

Challenge: for each department:

  1. Create a dataframe for that department.

  2. Add lineage to the dataframe (i.e., add the department name).

  3. Add the dataframe to a list of departments.

  4. Use pd.concat on the list of departments to create one dataframe. Print the dataframe. The output should look like:

    firstName lastName  age        dept
    0      John      Doe   23  accounting
    1      Mary    Smith   32  accounting
    2     Sally    Green   27       sales
    3       Jim   Galley   41       sales
    4       Tom    Brown   28   marketing
import pandas as pd
import requests

import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees-dict.json")
employees = response.json()

departments = []
for dept_name in employees.keys():
    # convert the department dictionary into a DataFrame
    dept_employees = pd.DataFrame(employees[dept_name])
    # Note: this also would work:
    # dept_employees = pd.json_normalize(employees,  record_path=dept_name)
    # Add the department name to the columns (data lineage):
    dept_employees['dept'] = dept_name
    departments.append(dept_employees)

combined = pd.concat(departments, ignore_index=True)
combined
firstName lastName age dept
0 John Doe 23 accounting
1 Mary Smith 32 accounting
2 Sally Green 27 sales
3 Jim Galley 41 sales
4 Tom Brown 28 marketing

De-duplication

Sometimes after a pd.concat() you will have duplicate rows.

You can use df.drop_duplicates() to remove repeated rows of data.

Without an argument, this will scan the entire row of data to determine if the row is the same.

If your data has a natural key, you can specify that with the subset= named argument. This will improve performance. An example:

o1 = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dedupe/orders1.csv")
o2 = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dedupe/orders2.csv")
orders = pd.concat([o1, o2], ignore_index=True)
orders.sort_values('orderid')
orderid orderdate custname custemail custcountry orderstatus ordertotal ordercreditcard ordershipvia shippingtotal
0 2 2023-03-24 Frayda Pepperd fpepperd0@sciencedaily.com Canada delivered 228.39 Discover RPS 12.05
10 2 2023-03-24 Frayda Pepperd fpepperd0@sciencedaily.com Canada delivered 228.39 Discover RPS 12.05
11 3 2020-02-23 Loy Siberry lsiberry1@so-net.ne.jp Canada delivered 76.87 Discover USPS 6.27
1 4 2022-04-28 Carree Henworth NaN Canada pending 152.30 Discover USPS 12.74
12 4 2022-04-28 Carree Henworth NaN Canada pending 152.30 Discover USPS 12.74
2 5 2019-11-22 Goldina Godsafe ggodsafe3@dailymail.co.uk United States shipped 182.17 Amex UPS 5.44
13 6 2022-05-03 Marris Chatten mchatten4@csmonitor.com Mexico pending 208.28 Discover RPS 2.16
3 6 2022-05-03 Marris Chatten mchatten4@csmonitor.com Mexico pending 208.28 Discover RPS 2.16
14 7 2022-12-19 Logan Jacobsson ljacobsson5@wufoo.com United States delivered 112.15 Amex USPS 11.52
4 7 2022-12-19 Logan Jacobsson ljacobsson5@wufoo.com United States delivered 112.15 Amex USPS 11.52
15 8 2019-06-05 Lilli Feares lfeares6@shop-pro.jp Mexico pending 237.90 Discover FedEX 4.48
16 9 2019-02-17 Lowrance Sigsworth lsigsworth7@youtube.com United States delivered 141.94 Discover USPS 7.31
17 10 2023-04-19 Libbi Spadari lspadari8@dot.gov Mexico pending 160.79 Discover RPS 16.52
5 10 2023-04-19 Libbi Spadari lspadari8@dot.gov Mexico pending 160.79 Discover RPS 16.52
6 11 2020-01-20 Renato Hue rhue9@un.org Canada delivered 120.52 Visa USPS 5.57
7 12 2022-03-03 Lucky Helstrip lhelstripa@tmall.com Mexico delivered 202.07 Amex UPS 18.57
18 12 2022-03-03 Lucky Helstrip lhelstripa@tmall.com Mexico delivered 202.07 Amex UPS 18.57
8 13 2021-09-04 Debi Myrie dmyrieb@unc.edu United States delivered 131.62 Amex UPS 2.37
19 13 2021-09-04 Debi Myrie dmyrieb@unc.edu United States delivered 131.62 Amex UPS 2.37
20 14 2022-02-27 Hyacinth Aveyard haveyardc@ucoz.com United States pending 209.86 Amex USPS 8.69
9 15 2019-01-11 Crin Blanket cblanketd@newsvine.com United States delivered 85.46 Visa UPS 14.22

Here, we use the entire row to check for duplicates:

orders.drop_duplicates().sort_values("orderid")
orderid orderdate custname custemail custcountry orderstatus ordertotal ordercreditcard ordershipvia shippingtotal
0 2 2023-03-24 Frayda Pepperd fpepperd0@sciencedaily.com Canada delivered 228.39 Discover RPS 12.05
11 3 2020-02-23 Loy Siberry lsiberry1@so-net.ne.jp Canada delivered 76.87 Discover USPS 6.27
1 4 2022-04-28 Carree Henworth NaN Canada pending 152.30 Discover USPS 12.74
2 5 2019-11-22 Goldina Godsafe ggodsafe3@dailymail.co.uk United States shipped 182.17 Amex UPS 5.44
3 6 2022-05-03 Marris Chatten mchatten4@csmonitor.com Mexico pending 208.28 Discover RPS 2.16
4 7 2022-12-19 Logan Jacobsson ljacobsson5@wufoo.com United States delivered 112.15 Amex USPS 11.52
15 8 2019-06-05 Lilli Feares lfeares6@shop-pro.jp Mexico pending 237.90 Discover FedEX 4.48
16 9 2019-02-17 Lowrance Sigsworth lsigsworth7@youtube.com United States delivered 141.94 Discover USPS 7.31
5 10 2023-04-19 Libbi Spadari lspadari8@dot.gov Mexico pending 160.79 Discover RPS 16.52
6 11 2020-01-20 Renato Hue rhue9@un.org Canada delivered 120.52 Visa USPS 5.57
7 12 2022-03-03 Lucky Helstrip lhelstripa@tmall.com Mexico delivered 202.07 Amex UPS 18.57
8 13 2021-09-04 Debi Myrie dmyrieb@unc.edu United States delivered 131.62 Amex UPS 2.37
20 14 2022-02-27 Hyacinth Aveyard haveyardc@ucoz.com United States pending 209.86 Amex USPS 8.69
9 15 2019-01-11 Crin Blanket cblanketd@newsvine.com United States delivered 85.46 Visa UPS 14.22

Here, we just use the orderid to check for duplicates:

orders.drop_duplicates(subset="orderid").sort_values("orderid")
orderid orderdate custname custemail custcountry orderstatus ordertotal ordercreditcard ordershipvia shippingtotal
0 2 2023-03-24 Frayda Pepperd fpepperd0@sciencedaily.com Canada delivered 228.39 Discover RPS 12.05
11 3 2020-02-23 Loy Siberry lsiberry1@so-net.ne.jp Canada delivered 76.87 Discover USPS 6.27
1 4 2022-04-28 Carree Henworth NaN Canada pending 152.30 Discover USPS 12.74
2 5 2019-11-22 Goldina Godsafe ggodsafe3@dailymail.co.uk United States shipped 182.17 Amex UPS 5.44
3 6 2022-05-03 Marris Chatten mchatten4@csmonitor.com Mexico pending 208.28 Discover RPS 2.16
4 7 2022-12-19 Logan Jacobsson ljacobsson5@wufoo.com United States delivered 112.15 Amex USPS 11.52
15 8 2019-06-05 Lilli Feares lfeares6@shop-pro.jp Mexico pending 237.90 Discover FedEX 4.48
16 9 2019-02-17 Lowrance Sigsworth lsigsworth7@youtube.com United States delivered 141.94 Discover USPS 7.31
5 10 2023-04-19 Libbi Spadari lspadari8@dot.gov Mexico pending 160.79 Discover RPS 16.52
6 11 2020-01-20 Renato Hue rhue9@un.org Canada delivered 120.52 Visa USPS 5.57
7 12 2022-03-03 Lucky Helstrip lhelstripa@tmall.com Mexico delivered 202.07 Amex UPS 18.57
8 13 2021-09-04 Debi Myrie dmyrieb@unc.edu United States delivered 131.62 Amex UPS 2.37
20 14 2022-02-27 Hyacinth Aveyard haveyardc@ucoz.com United States pending 209.86 Amex USPS 8.69
9 15 2019-01-11 Crin Blanket cblanketd@newsvine.com United States delivered 85.46 Visa UPS 14.22

That gives the same result as using the whole row because the orderid is one-to-one with the uniqueness of the row. It’s faster doing this though, even on this small dataset (use %timeit to compare)!

Notice if we had used a different subset which is not representative of the row, we lose data. For example:

orders.drop_duplicates(subset="ordercreditcard").sort_values("orderid")
orderid orderdate custname custemail custcountry orderstatus ordertotal ordercreditcard ordershipvia shippingtotal
0 2 2023-03-24 Frayda Pepperd fpepperd0@sciencedaily.com Canada delivered 228.39 Discover RPS 12.05
2 5 2019-11-22 Goldina Godsafe ggodsafe3@dailymail.co.uk United States shipped 182.17 Amex UPS 5.44
6 11 2020-01-20 Renato Hue rhue9@un.org Canada delivered 120.52 Visa USPS 5.57

Returning a dataframe of duplicates

To get a dataframe of just the duplicated values you can use df.duplicated(). This returns a boolean series that you can use to extract the duplicated rows from the concatenated dataframe. Example:

dupes = orders.duplicated(subset=['orderid'])
orders[dupes]
orderid orderdate custname custemail custcountry orderstatus ordertotal ordercreditcard ordershipvia shippingtotal
10 2 2023-03-24 Frayda Pepperd fpepperd0@sciencedaily.com Canada delivered 228.39 Discover RPS 12.05
12 4 2022-04-28 Carree Henworth NaN Canada pending 152.30 Discover USPS 12.74
13 6 2022-05-03 Marris Chatten mchatten4@csmonitor.com Mexico pending 208.28 Discover RPS 2.16
14 7 2022-12-19 Logan Jacobsson ljacobsson5@wufoo.com United States delivered 112.15 Amex USPS 11.52
17 10 2023-04-19 Libbi Spadari lspadari8@dot.gov Mexico pending 160.79 Discover RPS 16.52
18 12 2022-03-03 Lucky Helstrip lhelstripa@tmall.com Mexico delivered 202.07 Amex UPS 18.57
19 13 2021-09-04 Debi Myrie dmyrieb@unc.edu United States delivered 131.62 Amex UPS 2.37

Merges

A merge combines two dataframes based on a common column. The resulting dataframe is wider (has more columns) than the original dataframe. The function to do this is pd.merge(). It’s most common arguments:

pd.merge(left: pd.DataFrame, right:pd.Dataframe, how:str, left_on:str, right_on:str)

  • how specifies the join operation:
    • "inner" - returns ONLY rows when values of left_on match right_on

    • "left" - returns ALL rows from left and ONLY rows from right when values of left_on match right_on

    • "right" - returns ALL rows from right and ONLY rows from left when values of left_on match right_on

    • "outer" - returns ALL rows from left and right and rows when values of left_on match right_on

To illustrate the differences between these, consider merging the following two dataframes, one representing a roster of basketball players, and another a list of teams:

bbplayers = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/bbplayers.csv")
bbplayers
player_id player_name career_pts player_team_id
0 101 Jordan 32292 1.0
1 102 Pippen 18940 1.0
2 103 Bryant 33643 2.0
3 104 O'Neal 28596 2.0
4 105 Fudge 0 NaN
bbteams = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/bbteams.csv")
bbteams
team_id team_name team_location
0 1 Bulls Chicago, IL
1 2 Lakers Los Angeles, CA
2 3 Tropics Flint, MI

Inner join: Only rows that match the bbplayer.player_team_id and bbteam.team_id and bbteam will be included. Note that in this case we lose Player 105 and team 3 because there are no matches:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='inner')
combined
player_id player_name career_pts player_team_id team_id team_name team_location
0 101 Jordan 32292 1.0 1 Bulls Chicago, IL
1 102 Pippen 18940 1.0 1 Bulls Chicago, IL
2 103 Bryant 33643 2.0 2 Lakers Los Angeles, CA
3 104 O'Neal 28596 2.0 2 Lakers Los Angeles, CA

Left join: All rows from the left (every player) and ONLY rows from team that match. Now we see Player 105 despite no team match:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='left')
combined
player_id player_name career_pts player_team_id team_id team_name team_location
0 101 Jordan 32292 1.0 1.0 Bulls Chicago, IL
1 102 Pippen 18940 1.0 1.0 Bulls Chicago, IL
2 103 Bryant 33643 2.0 2.0 Lakers Los Angeles, CA
3 104 O'Neal 28596 2.0 2.0 Lakers Los Angeles, CA
4 105 Fudge 0 NaN NaN NaN NaN

Right join: All rows from the right (every team) and ONLY rows from players that match. We now see Team 3 despite no player match:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='right')
combined
player_id player_name career_pts player_team_id team_id team_name team_location
0 101.0 Jordan 32292.0 1.0 1 Bulls Chicago, IL
1 102.0 Pippen 18940.0 1.0 1 Bulls Chicago, IL
2 103.0 Bryant 33643.0 2.0 2 Lakers Los Angeles, CA
3 104.0 O'Neal 28596.0 2.0 2 Lakers Los Angeles, CA
4 NaN NaN NaN NaN 3 Tropics Flint, MI

Outer join: All rows from both tables are included. This is equivalent to doing an inner join + all non-matching rows from both tables. In this case, we see Team 3 and Player 105:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='outer')
combined
player_id player_name career_pts player_team_id team_id team_name team_location
0 101.0 Jordan 32292.0 1.0 1.0 Bulls Chicago, IL
1 102.0 Pippen 18940.0 1.0 1.0 Bulls Chicago, IL
2 103.0 Bryant 33643.0 2.0 2.0 Lakers Los Angeles, CA
3 104.0 O'Neal 28596.0 2.0 2.0 Lakers Los Angeles, CA
4 NaN NaN NaN NaN 3.0 Tropics Flint, MI
5 105.0 Fudge 0.0 NaN NaN NaN NaN
CautionCode Challenge 4.4.2: who is not buying from MiniMart?

Consider the following data from a grocery store: https://github.com/mafudge/datasets/tree/master/minimart. In that directory you’ll see customers.csv, which is a list of customers, and a separate CSV file of purchases made in the first four months of the year.

You have been hired to build a UI to display names of customers who did not buy from minimart in any given month. Write a Streamlit app that displays a dataframe of customers who did not buy anything in a given month. The app should have a drop down menu that allows the user to select the month to display.

The URL for the location of raw data that you can use in your app is:

https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/minimart/

import streamlit as st
import pandas as pd

base = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/minimart/"
months = ['jan', 'feb', 'mar', 'apr']

st.title("Who's not Buying from MiniMart?")
month = st.selectbox('Select Month:', months)

purchases = pd.read_csv(f"{base}/purchases-{month}.csv")
customers = pd.read_csv(f"{base}/customers.csv")
combined = pd.merge(customers, purchases, left_on='customer_id', right_on='customer_id', how='left')
cols = ["customer_id", "firstname", "lastname"]
did_not_buy = combined["order_id"].isnull()
customers_who_did_not_buy = combined[did_not_buy][cols]
st.header(f"These people did not buy anything in {month.capitalize()}.:")
st.dataframe(customers_who_did_not_buy, hide_index=True)

# You can add the following to double check the results:
#st.divider()
#st.write("debug")
#st.dataframe(combined)
# That will display the full combined frame for the selected month. The ones
# without entries for the order id should be what's displayed in the table above.