4. Joining Multiple Pandas Dataframes

In this unit we will discuss strategies for dealing with multiple dataframes and combing them into a single dataframe.

Concatenation

Concatenation appends the rows of one or more dataframes together. This is a row-oriented operation so the resulting dataframe will be longer. For example, if a 50-row dataframe is concatenated with a 40-row dataframe, you will have a 90-row dataframe.

The pd.concat() function is used to concatenate frames. It takes several arguments, but the most often-used ones are: pd.concat(items: list[pd.DataFrame], ignore_index=False). The first argument is a list of Dataframes to concat. The ignore_index keyword argument governs what happens to the indices in the combined frame: we can choose to keep the current index in each dataframe ignore_index=False or create a new index ignore_index=True.

An example:

import pandas as pd
campus_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/campus-students.csv")
campus_students

	Name	Grade	Year
0	Helen	NaN	Sophomore
1	Iris	10.0	Senior
2	Jimmy	8.0	Freshman
3	Karen	NaN	Freshman
4	Lynne	10.0	Sophomore
5	Mike	10.0	Sophomore
6	Nico	NaN	Junior
7	Pete	8.0	Freshman

online_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/online-students.csv")
online_students

	Name	Grade	Year	Location
0	Abby	7.0	Freshman	NY
1	Bob	9.0	Sophomore	CA
2	Chris	10.0	Senior	CA
3	Dave	8.0	Freshman	NY
4	Ellen	7.0	Sophomore	TX
5	Fran	10.0	Senior	FL
6	Greg	8.0	Freshman	NY

combined_students = pd.concat([campus_students, online_students])
combined_students

	Name	Grade	Year	Location
0	Helen	NaN	Sophomore	NaN
1	Iris	10.0	Senior	NaN
2	Jimmy	8.0	Freshman	NaN
3	Karen	NaN	Freshman	NaN
4	Lynne	10.0	Sophomore	NaN
5	Mike	10.0	Sophomore	NaN
6	Nico	NaN	Junior	NaN
7	Pete	8.0	Freshman	NaN
0	Abby	7.0	Freshman	NY
1	Bob	9.0	Sophomore	CA
2	Chris	10.0	Senior	CA
3	Dave	8.0	Freshman	NY
4	Ellen	7.0	Sophomore	TX
5	Fran	10.0	Senior	FL
6	Greg	8.0	Freshman	NY

concat() - Ignoring the index

As you can see from the code above the index from the original DataFrames was used. For example Helen and Abby both share the index 0. This means that if you provide index 0, you’ll get both of them, e.g.:

combined_students['Name'][0]

0    Helen
0     Abby
Name: Name, dtype: str

While this is acceptable, there are situations where a new index based on combined values is desirable. We will encounter this later when grouping or pivioting data.

To make this happen include the ignore_index=True named argument. This will create a new index from the output DataFrame.

students = pd.concat([campus_students, online_students], ignore_index=True)
students

	Name	Grade	Year	Location
0	Helen	NaN	Sophomore	NaN
1	Iris	10.0	Senior	NaN
2	Jimmy	8.0	Freshman	NaN
3	Karen	NaN	Freshman	NaN
4	Lynne	10.0	Sophomore	NaN
5	Mike	10.0	Sophomore	NaN
6	Nico	NaN	Junior	NaN
7	Pete	8.0	Freshman	NaN
8	Abby	7.0	Freshman	NY
9	Bob	9.0	Sophomore	CA
10	Chris	10.0	Senior	CA
11	Dave	8.0	Freshman	NY
12	Ellen	7.0	Sophomore	TX
13	Fran	10.0	Senior	FL
14	Greg	8.0	Freshman	NY

Best practice - data lineage

When combining datasets, its a really good idea to retain data lineage, or a record of where the data came from. This can be done by added a column to each dataframe before concatenating.

In this example we create a student "type" column to track lineage.

campus_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/campus-students.csv")
campus_students['type'] = 'campus'
online_students = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/delimited/online-students.csv")
online_students['type'] = 'online'
students = pd.concat([campus_students, online_students], ignore_index=True)
students

	Name	Grade	Year	type	Location
0	Helen	NaN	Sophomore	campus	NaN
1	Iris	10.0	Senior	campus	NaN
2	Jimmy	8.0	Freshman	campus	NaN
3	Karen	NaN	Freshman	campus	NaN
4	Lynne	10.0	Sophomore	campus	NaN
5	Mike	10.0	Sophomore	campus	NaN
6	Nico	NaN	Junior	campus	NaN
7	Pete	8.0	Freshman	campus	NaN
8	Abby	7.0	Freshman	online	NY
9	Bob	9.0	Sophomore	online	CA
10	Chris	10.0	Senior	online	CA
11	Dave	8.0	Freshman	online	NY
12	Ellen	7.0	Sophomore	online	TX
13	Fran	10.0	Senior	online	FL
14	Greg	8.0	Freshman	online	NY

Code Challenge 4.4.1: classic use case for concatenation

Consider the JSON file here: https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees-dict.json You can find a copy of this file in your class-notebook repository, in data-wrangling/data/json-examples/employees-dict.json.

Let’s take a look at the structure of this file in a dictionary format:

import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees-dict.json")
employees = response.json()
employees

{'accounting': [{'firstName': 'John', 'lastName': 'Doe', 'age': 23},
  {'firstName': 'Mary', 'lastName': 'Smith', 'age': 32}],
 'sales': [{'firstName': 'Sally', 'lastName': 'Green', 'age': 27},
  {'firstName': 'Jim', 'lastName': 'Galley', 'age': 41}],
 'marketing': [{'firstName': 'Tom', 'lastName': 'Brown', 'age': 28}]}

If you are working in your class-notebook repository, you can load the file using:

import json
# note that we are assuming your current working directory (i.e., where you are
# in your file system when running your notebook) is data-wrangling. If your
# are in your class-notebook directory, then change the file path to
# data-wrangling/data/json-examples/exmployees-dict.json
with open('data/json-examples/employees-dict.json', 'r') as fp:
    employees = json.load(fp)
employees

The issue with the JSON data is that there are employees under keys by department "accounting", "sales", "marketing":

print("departments", employees.keys())

departments dict_keys(['accounting', 'sales', 'marketing'])

This is the classic use-case for pd.concat() as there is no practical way to use pd.json_normalize() to get all the employees under each department.

Challenge: for each department:

Create a dataframe for that department.
Add lineage to the dataframe (i.e., add the department name).
Add the dataframe to a list of departments.
Use pd.concat on the list of departments to create one dataframe. Print the dataframe. The output should look like:

    firstName lastName  age        dept
    0      John      Doe   23  accounting
    1      Mary    Smith   32  accounting
    2     Sally    Green   27       sales
    3       Jim   Galley   41       sales
    4       Tom    Brown   28   marketing

Solution

import pandas as pd
import requests

import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees-dict.json")
employees = response.json()

departments = []
for dept_name in employees.keys():
    # convert the department dictionary into a DataFrame
    dept_employees = pd.DataFrame(employees[dept_name])
    # Note: this also would work:
    # dept_employees = pd.json_normalize(employees,  record_path=dept_name)
    # Add the department name to the columns (data lineage):
    dept_employees['dept'] = dept_name
    departments.append(dept_employees)

combined = pd.concat(departments, ignore_index=True)
combined

	firstName	lastName	age	dept
0	John	Doe	23	accounting
1	Mary	Smith	32	accounting
2	Sally	Green	27	sales
3	Jim	Galley	41	sales
4	Tom	Brown	28	marketing

De-duplication

Sometimes after a pd.concat() you will have duplicate rows.

You can use df.drop_duplicates() to remove repeated rows of data.

Without an argument, this will scan the entire row of data to determine if the row is the same.

If your data has a natural key, you can specify that with the subset= named argument. This will improve performance. An example:

o1 = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dedupe/orders1.csv")
o2 = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dedupe/orders2.csv")
orders = pd.concat([o1, o2], ignore_index=True)
orders.sort_values('orderid')

	orderid	orderdate	custname	custemail	custcountry	orderstatus	ordertotal	ordercreditcard	ordershipvia	shippingtotal
0	2	2023-03-24	Frayda Pepperd	fpepperd0@sciencedaily.com	Canada	delivered	228.39	Discover	RPS	12.05
10	2	2023-03-24	Frayda Pepperd	fpepperd0@sciencedaily.com	Canada	delivered	228.39	Discover	RPS	12.05
11	3	2020-02-23	Loy Siberry	lsiberry1@so-net.ne.jp	Canada	delivered	76.87	Discover	USPS	6.27
1	4	2022-04-28	Carree Henworth	NaN	Canada	pending	152.30	Discover	USPS	12.74
12	4	2022-04-28	Carree Henworth	NaN	Canada	pending	152.30	Discover	USPS	12.74
2	5	2019-11-22	Goldina Godsafe	ggodsafe3@dailymail.co.uk	United States	shipped	182.17	Amex	UPS	5.44
13	6	2022-05-03	Marris Chatten	mchatten4@csmonitor.com	Mexico	pending	208.28	Discover	RPS	2.16
3	6	2022-05-03	Marris Chatten	mchatten4@csmonitor.com	Mexico	pending	208.28	Discover	RPS	2.16
14	7	2022-12-19	Logan Jacobsson	ljacobsson5@wufoo.com	United States	delivered	112.15	Amex	USPS	11.52
4	7	2022-12-19	Logan Jacobsson	ljacobsson5@wufoo.com	United States	delivered	112.15	Amex	USPS	11.52
15	8	2019-06-05	Lilli Feares	lfeares6@shop-pro.jp	Mexico	pending	237.90	Discover	FedEX	4.48
16	9	2019-02-17	Lowrance Sigsworth	lsigsworth7@youtube.com	United States	delivered	141.94	Discover	USPS	7.31
17	10	2023-04-19	Libbi Spadari	lspadari8@dot.gov	Mexico	pending	160.79	Discover	RPS	16.52
5	10	2023-04-19	Libbi Spadari	lspadari8@dot.gov	Mexico	pending	160.79	Discover	RPS	16.52
6	11	2020-01-20	Renato Hue	rhue9@un.org	Canada	delivered	120.52	Visa	USPS	5.57
7	12	2022-03-03	Lucky Helstrip	lhelstripa@tmall.com	Mexico	delivered	202.07	Amex	UPS	18.57
18	12	2022-03-03	Lucky Helstrip	lhelstripa@tmall.com	Mexico	delivered	202.07	Amex	UPS	18.57
8	13	2021-09-04	Debi Myrie	dmyrieb@unc.edu	United States	delivered	131.62	Amex	UPS	2.37
19	13	2021-09-04	Debi Myrie	dmyrieb@unc.edu	United States	delivered	131.62	Amex	UPS	2.37
20	14	2022-02-27	Hyacinth Aveyard	haveyardc@ucoz.com	United States	pending	209.86	Amex	USPS	8.69
9	15	2019-01-11	Crin Blanket	cblanketd@newsvine.com	United States	delivered	85.46	Visa	UPS	14.22

Here, we use the entire row to check for duplicates:

orders.drop_duplicates().sort_values("orderid")

	orderid	orderdate	custname	custemail	custcountry	orderstatus	ordertotal	ordercreditcard	ordershipvia	shippingtotal
0	2	2023-03-24	Frayda Pepperd	fpepperd0@sciencedaily.com	Canada	delivered	228.39	Discover	RPS	12.05
11	3	2020-02-23	Loy Siberry	lsiberry1@so-net.ne.jp	Canada	delivered	76.87	Discover	USPS	6.27
1	4	2022-04-28	Carree Henworth	NaN	Canada	pending	152.30	Discover	USPS	12.74
2	5	2019-11-22	Goldina Godsafe	ggodsafe3@dailymail.co.uk	United States	shipped	182.17	Amex	UPS	5.44
3	6	2022-05-03	Marris Chatten	mchatten4@csmonitor.com	Mexico	pending	208.28	Discover	RPS	2.16
4	7	2022-12-19	Logan Jacobsson	ljacobsson5@wufoo.com	United States	delivered	112.15	Amex	USPS	11.52
15	8	2019-06-05	Lilli Feares	lfeares6@shop-pro.jp	Mexico	pending	237.90	Discover	FedEX	4.48
16	9	2019-02-17	Lowrance Sigsworth	lsigsworth7@youtube.com	United States	delivered	141.94	Discover	USPS	7.31
5	10	2023-04-19	Libbi Spadari	lspadari8@dot.gov	Mexico	pending	160.79	Discover	RPS	16.52
6	11	2020-01-20	Renato Hue	rhue9@un.org	Canada	delivered	120.52	Visa	USPS	5.57
7	12	2022-03-03	Lucky Helstrip	lhelstripa@tmall.com	Mexico	delivered	202.07	Amex	UPS	18.57
8	13	2021-09-04	Debi Myrie	dmyrieb@unc.edu	United States	delivered	131.62	Amex	UPS	2.37
20	14	2022-02-27	Hyacinth Aveyard	haveyardc@ucoz.com	United States	pending	209.86	Amex	USPS	8.69
9	15	2019-01-11	Crin Blanket	cblanketd@newsvine.com	United States	delivered	85.46	Visa	UPS	14.22

Here, we just use the orderid to check for duplicates:

orders.drop_duplicates(subset="orderid").sort_values("orderid")

	orderid	orderdate	custname	custemail	custcountry	orderstatus	ordertotal	ordercreditcard	ordershipvia	shippingtotal
0	2	2023-03-24	Frayda Pepperd	fpepperd0@sciencedaily.com	Canada	delivered	228.39	Discover	RPS	12.05
11	3	2020-02-23	Loy Siberry	lsiberry1@so-net.ne.jp	Canada	delivered	76.87	Discover	USPS	6.27
1	4	2022-04-28	Carree Henworth	NaN	Canada	pending	152.30	Discover	USPS	12.74
2	5	2019-11-22	Goldina Godsafe	ggodsafe3@dailymail.co.uk	United States	shipped	182.17	Amex	UPS	5.44
3	6	2022-05-03	Marris Chatten	mchatten4@csmonitor.com	Mexico	pending	208.28	Discover	RPS	2.16
4	7	2022-12-19	Logan Jacobsson	ljacobsson5@wufoo.com	United States	delivered	112.15	Amex	USPS	11.52
15	8	2019-06-05	Lilli Feares	lfeares6@shop-pro.jp	Mexico	pending	237.90	Discover	FedEX	4.48
16	9	2019-02-17	Lowrance Sigsworth	lsigsworth7@youtube.com	United States	delivered	141.94	Discover	USPS	7.31
5	10	2023-04-19	Libbi Spadari	lspadari8@dot.gov	Mexico	pending	160.79	Discover	RPS	16.52
6	11	2020-01-20	Renato Hue	rhue9@un.org	Canada	delivered	120.52	Visa	USPS	5.57
7	12	2022-03-03	Lucky Helstrip	lhelstripa@tmall.com	Mexico	delivered	202.07	Amex	UPS	18.57
8	13	2021-09-04	Debi Myrie	dmyrieb@unc.edu	United States	delivered	131.62	Amex	UPS	2.37
20	14	2022-02-27	Hyacinth Aveyard	haveyardc@ucoz.com	United States	pending	209.86	Amex	USPS	8.69
9	15	2019-01-11	Crin Blanket	cblanketd@newsvine.com	United States	delivered	85.46	Visa	UPS	14.22

That gives the same result as using the whole row because the orderid is one-to-one with the uniqueness of the row. It’s faster doing this though, even on this small dataset (use %timeit to compare)!

Notice if we had used a different subset which is not representative of the row, we lose data. For example:

orders.drop_duplicates(subset="ordercreditcard").sort_values("orderid")

	orderid	orderdate	custname	custemail	custcountry	orderstatus	ordertotal	ordercreditcard	ordershipvia	shippingtotal
0	2	2023-03-24	Frayda Pepperd	fpepperd0@sciencedaily.com	Canada	delivered	228.39	Discover	RPS	12.05
2	5	2019-11-22	Goldina Godsafe	ggodsafe3@dailymail.co.uk	United States	shipped	182.17	Amex	UPS	5.44
6	11	2020-01-20	Renato Hue	rhue9@un.org	Canada	delivered	120.52	Visa	USPS	5.57

Returning a dataframe of duplicates

To get a dataframe of just the duplicated values you can use df.duplicated(). This returns a boolean series that you can use to extract the duplicated rows from the concatenated dataframe. Example:

dupes = orders.duplicated(subset=['orderid'])
orders[dupes]

	orderid	orderdate	custname	custemail	custcountry	orderstatus	ordertotal	ordercreditcard	ordershipvia	shippingtotal
10	2	2023-03-24	Frayda Pepperd	fpepperd0@sciencedaily.com	Canada	delivered	228.39	Discover	RPS	12.05
12	4	2022-04-28	Carree Henworth	NaN	Canada	pending	152.30	Discover	USPS	12.74
13	6	2022-05-03	Marris Chatten	mchatten4@csmonitor.com	Mexico	pending	208.28	Discover	RPS	2.16
14	7	2022-12-19	Logan Jacobsson	ljacobsson5@wufoo.com	United States	delivered	112.15	Amex	USPS	11.52
17	10	2023-04-19	Libbi Spadari	lspadari8@dot.gov	Mexico	pending	160.79	Discover	RPS	16.52
18	12	2022-03-03	Lucky Helstrip	lhelstripa@tmall.com	Mexico	delivered	202.07	Amex	UPS	18.57
19	13	2021-09-04	Debi Myrie	dmyrieb@unc.edu	United States	delivered	131.62	Amex	UPS	2.37

Merges

A merge combines two dataframes based on a common column. The resulting dataframe is wider (has more columns) than the original dataframe. The function to do this is pd.merge(). It’s most common arguments:

pd.merge(left: pd.DataFrame, right:pd.Dataframe, how:str, left_on:str, right_on:str)

how specifies the join operation:
- "inner" - returns ONLY rows when values of left_on match right_on
- "left" - returns ALL rows from left and ONLY rows from right when values of left_on match right_on
- "right" - returns ALL rows from right and ONLY rows from left when values of left_on match right_on
- "outer" - returns ALL rows from left and right and rows when values of left_on match right_on

To illustrate the differences between these, consider merging the following two dataframes, one representing a roster of basketball players, and another a list of teams:

bbplayers = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/bbplayers.csv")
bbplayers

	player_id	player_name	career_pts	player_team_id
0	101	Jordan	32292	1.0
1	102	Pippen	18940	1.0
2	103	Bryant	33643	2.0
3	104	O'Neal	28596	2.0
4	105	Fudge	0	NaN

bbteams = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/bbteams.csv")
bbteams

	team_id	team_name	team_location
0	1	Bulls	Chicago, IL
1	2	Lakers	Los Angeles, CA
2	3	Tropics	Flint, MI

Inner join: Only rows that match the bbplayer.player_team_id and bbteam.team_id and bbteam will be included. Note that in this case we lose Player 105 and team 3 because there are no matches:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='inner')
combined

	player_id	player_name	career_pts	player_team_id	team_id	team_name	team_location
0	101	Jordan	32292	1.0	1	Bulls	Chicago, IL
1	102	Pippen	18940	1.0	1	Bulls	Chicago, IL
2	103	Bryant	33643	2.0	2	Lakers	Los Angeles, CA
3	104	O'Neal	28596	2.0	2	Lakers	Los Angeles, CA

Left join: All rows from the left (every player) and ONLY rows from team that match. Now we see Player 105 despite no team match:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='left')
combined

	player_id	player_name	career_pts	player_team_id	team_id	team_name	team_location
0	101	Jordan	32292	1.0	1.0	Bulls	Chicago, IL
1	102	Pippen	18940	1.0	1.0	Bulls	Chicago, IL
2	103	Bryant	33643	2.0	2.0	Lakers	Los Angeles, CA
3	104	O'Neal	28596	2.0	2.0	Lakers	Los Angeles, CA
4	105	Fudge	0	NaN	NaN	NaN	NaN

Right join: All rows from the right (every team) and ONLY rows from players that match. We now see Team 3 despite no player match:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='right')
combined

	player_id	player_name	career_pts	player_team_id	team_id	team_name	team_location
0	101.0	Jordan	32292.0	1.0	1	Bulls	Chicago, IL
1	102.0	Pippen	18940.0	1.0	1	Bulls	Chicago, IL
2	103.0	Bryant	33643.0	2.0	2	Lakers	Los Angeles, CA
3	104.0	O'Neal	28596.0	2.0	2	Lakers	Los Angeles, CA
4	NaN	NaN	NaN	NaN	3	Tropics	Flint, MI

Outer join: All rows from both tables are included. This is equivalent to doing an inner join + all non-matching rows from both tables. In this case, we see Team 3 and Player 105:

combined = pd.merge(bbplayers, bbteams, left_on='player_team_id', right_on='team_id', how='outer')
combined

	player_id	player_name	career_pts	player_team_id	team_id	team_name	team_location
0	101.0	Jordan	32292.0	1.0	1.0	Bulls	Chicago, IL
1	102.0	Pippen	18940.0	1.0	1.0	Bulls	Chicago, IL
2	103.0	Bryant	33643.0	2.0	2.0	Lakers	Los Angeles, CA
3	104.0	O'Neal	28596.0	2.0	2.0	Lakers	Los Angeles, CA
4	NaN	NaN	NaN	NaN	3.0	Tropics	Flint, MI
5	105.0	Fudge	0.0	NaN	NaN	NaN	NaN

Code Challenge 4.4.2: who is not buying from MiniMart?

Consider the following data from a grocery store: https://github.com/mafudge/datasets/tree/master/minimart. In that directory you’ll see customers.csv, which is a list of customers, and a separate CSV file of purchases made in the first four months of the year.

You have been hired to build a UI to display names of customers who did not buy from minimart in any given month. Write a Streamlit app that displays a dataframe of customers who did not buy anything in a given month. The app should have a drop down menu that allows the user to select the month to display.

The URL for the location of raw data that you can use in your app is:

https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/minimart/

Solution

import streamlit as st
import pandas as pd

base = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/minimart/"
months = ['jan', 'feb', 'mar', 'apr']

st.title("Who's not Buying from MiniMart?")
month = st.selectbox('Select Month:', months)

purchases = pd.read_csv(f"{base}/purchases-{month}.csv")
customers = pd.read_csv(f"{base}/customers.csv")
combined = pd.merge(customers, purchases, left_on='customer_id', right_on='customer_id', how='left')
cols = ["customer_id", "firstname", "lastname"]
did_not_buy = combined["order_id"].isnull()
customers_who_did_not_buy = combined[did_not_buy][cols]
st.header(f"These people did not buy anything in {month.capitalize()}.:")
st.dataframe(customers_who_did_not_buy, hide_index=True)

# You can add the following to double check the results:
#st.divider()
#st.write("debug")
#st.dataframe(combined)
# That will display the full combined frame for the selected month. The ones
# without entries for the order id should be what's displayed in the table above.