checks_data = 'https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv'5. Basic data cleaning with Pandas
In this lesson we will start learning how to clean a dataframe data and loop over it.
For this tutorial, we will be using some data representing a pretend restaurant’s transactions. The file is:
This file is also provided in your class notebook repository, in data-wrangling/data/delimited.
Let’s load the file with Pandas and sample a few of its rows to see what it contains:
import pandas as pd
checks = pd.read_csv(checks_data)
checks.sample(10)| check | date | party size | total items on check | total amount of check | gratuity | |
|---|---|---|---|---|---|---|
| 33 | 3842 | 2024-03-31 | 6 | 6 | $147.12 | $5.88 |
| 30 | 2705 | 2024-07-08 | 10 | 19 | $838.85 | $671.08 |
| 45 | 4031 | 2024-07-12 | 9 | 13 | $129.74 | $29.84 |
| 40 | 2512 | 2024-03-30 | 3 | 12 | $181.56 | $39.94 |
| 13 | 3867 | 2024-05-02 | 4 | 14 | $499.10 | $119.78 |
| 19 | 3718 | 2024-10-30 | 2 | 5 | $464.70 | $120.82 |
| 24 | 4310 | 2024-11-30 | 10 | 34 | $3,262.30 | $913.44 |
| 43 | 1186 | 2024-09-21 | 5 | 16 | $298.72 | $74.68 |
| 6 | 2527 | 2024-03-27 | 6 | 21 | $921.48 | $55.29 |
| 42 | 1361 | 2024-11-21 | 7 | 14 | $65.80 | $16.45 |
Let’s use the info method to get some more information about the columns:
checks.info()<class 'pandas.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 check 50 non-null int64
1 date 50 non-null str
2 party size 50 non-null int64
3 total items on check 50 non-null int64
4 total amount of check 50 non-null str
5 gratuity 50 non-null str
dtypes: int64(3), str(3)
memory usage: 2.5 KB
There’s something odd here! Note that the data type of some of the columns (e.g., total amount of check) are object instead of floats, as you might expect. As we’ll see below, this is because of the $ in the values; that will cause issues when we try to work with these columns. We’ll learn how to “clean” these columns so that we can do useful things with them.
Apply
The apply method allows us to execute a function over a Series or the entire DataFrame.
The general syntax:
Series.apply(func)<== call functionfuncfor every item in the Series.DataFrame.apply(func, axis=1)<== call functionfuncfor every row in the DataFrame (axis=1=> row).DataFrame.apply(func, axis=0)<== call functionfuncfor every column in the DataFrame (axis=0=> col).
Note that the first argument apply is the function itself, not the function applied to some data. For example, suppose we define a function called sq that squares the input values:
def sq(x):
return x**2To apply this to one of the columns in our DataFrame (say, the party size column):
checks['party size'].apply(sq)0 64
1 9
2 25
3 4
4 36
5 1
6 36
7 64
8 100
9 1
10 36
11 100
12 16
13 16
14 1
15 25
16 25
17 9
18 64
19 4
20 25
21 1
22 64
23 4
24 100
25 36
26 36
27 1
28 16
29 9
30 100
31 9
32 9
33 36
34 100
35 16
36 4
37 81
38 49
39 49
40 9
41 49
42 49
43 25
44 49
45 81
46 1
47 1
48 81
49 81
Name: party size, dtype: int64
Why apply?
In the above example, you might wonder why we don’t just apply the function directly to the Series, rather than use apply. Afterall, in the first pandas tutorial we learned that Series are vectorized just like numpy arrays. In other words, why not just do:
sq(checks['party size'])0 64
1 9
2 25
3 4
4 36
5 1
6 36
7 64
8 100
9 1
10 36
11 100
12 16
13 16
14 1
15 25
16 25
17 9
18 64
19 4
20 25
21 1
22 64
23 4
24 100
25 36
26 36
27 1
28 16
29 9
30 100
31 9
32 9
33 36
34 100
35 16
36 4
37 81
38 49
39 49
40 9
41 49
42 49
43 25
44 49
45 81
46 1
47 1
48 81
49 81
Name: party size, dtype: int64
In this case, you could just run the function on the series. Where apply is useful is when you have more complicated functions, in particular, ones that need to do different things depending on what the input data is. For example, suppose we define the following function to group parties into small, medium, and large depending on how many people are in the party:
def classify_size(x):
if x < 4:
return 'small'
elif x < 8:
return 'medium'
else:
return 'large'If we try to run this on the party size Series, we get an error:
classify_size(checks['party size'])--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_10217/1113461677.py in ?() ----> 1 classify_size(checks['party size']) /tmp/ipykernel_10217/2519828553.py in ?(x) 1 def classify_size(x): ----> 2 if x < 4: 3 return 'small' 4 elif x < 8: 5 return 'medium' /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/generic.py in ?(self) 1511 @final 1512 def __bool__(self) -> NoReturn: -> 1513 raise ValueError( 1514 f"The truth value of a {type(self).__name__} is ambiguous. " 1515 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1516 ) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is because if statements cannot be vectorized like this. In contrast, the apply method does allow us to apply the function to the series:
checks['party size'].apply(classify_size)0 large
1 small
2 medium
3 small
4 medium
5 small
6 medium
7 large
8 large
9 small
10 medium
11 large
12 medium
13 medium
14 small
15 medium
16 medium
17 small
18 large
19 small
20 medium
21 small
22 large
23 small
24 large
25 medium
26 medium
27 small
28 medium
29 small
30 large
31 small
32 small
33 medium
34 large
35 medium
36 small
37 large
38 medium
39 medium
40 small
41 medium
42 medium
43 medium
44 medium
45 large
46 small
47 small
48 large
49 large
Name: party size, dtype: str
This is because the apply method takes care to cycle over every element in the series and apply the function.
Cleaning data with apply
The ability of apply to apply more complicated functions involving if statements makes it extremely useful for cleaning datasets. By “cleaning” we mean reformatting data and/or removing spurious values, so that we can use it without issue.
For example, say we want to add price per item to our DataFrame, defined as:
price per item = total amount of check / total items on check
The problem is total amount of check is an object, not a float. This means we cannot do math on it:
# This will raise a TypeError because of the dollar sign and commas!!!
checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[10], line 2 1 # This will raise a TypeError because of the dollar sign and commas!!! ----> 2 checks['price_per_item'] = checks['total amount of check'] / checks['total items on check'] File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:85, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 82 other = sanitize_array(other, None) 83 other = ensure_wrapped_if_datetimelike(other) ---> 85 return method(self, other) File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:214, in OpsMixin.__truediv__(self, other) 212 @unpack_zerodim_and_defer("__truediv__") 213 def __truediv__(self, other): --> 214 return self._arith_method(other, operator.truediv) File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/series.py:6751, in Series._arith_method(self, other, op) 6749 def _arith_method(self, other, op): 6750 self, other = self._align_for_op(other) -> 6751 return base.IndexOpsMixin._arith_method(self, other, op) File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/base.py:1644, in IndexOpsMixin._arith_method(self, other, op) 1641 rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step) 1643 with np.errstate(all="ignore"): -> 1644 result = ops.arithmetic_op(lvalues, rvalues, op) 1646 return self._construct_result(result, name=res_name, other=other) File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:279, in arithmetic_op(left, right, op) 270 right = ensure_wrapped_if_datetimelike(right) 272 if ( 273 should_extension_dispatch(left, right) 274 or isinstance(right, (Timedelta, BaseOffset, Timestamp)) (...) 277 # Timedelta/Timestamp and other custom scalars are included in the check 278 # because numexpr will fail on it, see GH#31457 --> 279 res_values = op(left, right) 280 else: 281 # TODO we should handle EAs consistently and move this check before the if/else 282 # (https://github.com/pandas-dev/pandas/issues/41165) 283 # error: Argument 2 to "_bool_arith_check" has incompatible type 284 # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]" 285 _bool_arith_check(op, left, right) # type: ignore[arg-type] File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:85, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 82 other = sanitize_array(other, None) 83 other = ensure_wrapped_if_datetimelike(other) ---> 85 return method(self, other) File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:214, in OpsMixin.__truediv__(self, other) 212 @unpack_zerodim_and_defer("__truediv__") 213 def __truediv__(self, other): --> 214 return self._arith_method(other, operator.truediv) File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/arrays/string_.py:1212, in StringArray._cmp_method(self, other, op) 1210 result = np.empty_like(self._ndarray, dtype="object") 1211 result[mask] = self.dtype.na_value -> 1212 result[valid] = op(self._ndarray[valid], other) 1213 if isinstance(other, Path): 1214 # GH#61940 1215 return result TypeError: unsupported operand type(s) for /: 'str' and 'int'
How do we fix this? Let’s write a function to convert string values like this: $4,590.45 into floats like this: 4590.45
def clean_currency(value:str) -> float:
'''
This function will take a string value and remove the dollar sign and commas
and return a float value.
'''
return float(value.replace(',', '').replace('$', ''))
# tests
assert clean_currency('$1,000.00') == 1000.00
assert clean_currency('$1,000') == 1000.00
assert clean_currency('1,000') == 1000.00
assert clean_currency('$1000') == 1000.00With our function written we can use apply() to transform the series:
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks.sample(10)| check | date | party size | total items on check | total amount of check | gratuity | total_amount_of_check_cleaned | price_per_item | |
|---|---|---|---|---|---|---|---|---|
| 12 | 4528 | 2024-01-02 | 4 | 6 | $49.98 | $7.50 | 49.98 | 8.33 |
| 28 | 2446 | 2024-12-15 | 4 | 12 | $575.64 | $28.78 | 575.64 | 47.97 |
| 7 | 1564 | 2024-09-23 | 8 | 11 | $928.40 | $204.25 | 928.40 | 84.40 |
| 15 | 2386 | 2024-03-31 | 5 | 12 | $1,147.80 | $137.74 | 1147.80 | 95.65 |
| 48 | 4161 | 2024-06-22 | 9 | 28 | $1,385.16 | $235.48 | 1385.16 | 49.47 |
| 46 | 3621 | 2024-06-23 | 1 | 2 | $138.76 | $19.43 | 138.76 | 69.38 |
| 23 | 1194 | 2024-07-06 | 2 | 6 | $453.06 | $72.49 | 453.06 | 75.51 |
| 39 | 1707 | 2024-03-24 | 7 | 8 | $341.44 | $47.80 | 341.44 | 42.68 |
| 3 | 1957 | 2024-02-15 | 2 | 2 | $42.44 | $8.91 | 42.44 | 21.22 |
| 32 | 1440 | 2024-11-30 | 3 | 8 | $589.04 | $141.37 | 589.04 | 73.63 |
Remember its a really good idea to track lineage when you are building a data pipeline.
NEVER replace columns, always create new ones.
Using lambdas to apply functions to multiple columns
So far we’ve used apply with functions that take in a single Series. What do we do if we need a function to operate on multiple columns in a DataFrame? For that, we can use Python lambda functions.
A simple example
Suppose we have the following dataset of grades for some students:
students = pd.DataFrame({'hw_avg': [85, 92, 99, 78],
'final_exam': [88, 95, None, 75]},
index=['Adam', 'Brenda', 'Charlie', 'Dina'])
students| hw_avg | final_exam | |
|---|---|---|
| Adam | 85 | 88.0 |
| Brenda | 92 | 95.0 |
| Charlie | 99 | NaN |
| Dina | 78 | 75.0 |
We want to calculate their grade for the semester using the following function:
def grade(hw_avg, final_exam):
if pd.isna(final_exam):
return 'F'
overall = 0.4*hw_avg + 0.6*final_exam
if overall >= 90:
return 'A'
elif overall >= 80:
return 'B'
elif overall >= 70:
return 'C'
elif overall >= 60:
return 'D'
else:
return 'F'Since this function involves if statements, we need to use apply in order to run it on our students DataFrame. But grade needs two columns as input, hw_avg and final_exam. How do we do that?
We can use a lambda function to quickly define a small function that will return the required columns when provided a row. In general, the syntax for a lambda function is lambda ARGS: FUNC. For example, lambda a, b: a+b will return the sum of the two arguments its given.
In our case, we can use a lambda function to pull out the needed columns and give them to apply, like so:
students['grade'] = students.apply(lambda row: grade(row['hw_avg'], row['final_exam']),
axis=1)
students| hw_avg | final_exam | grade | |
|---|---|---|---|
| Adam | 85 | 88.0 | B |
| Brenda | 92 | 95.0 | A |
| Charlie | 99 | NaN | F |
| Dina | 78 | 75.0 | C |
An example using our checks data
Marketing wants you to build some key performance indicators (KPIs) using the checks data. A KPI is a statistic that summarizes some larger data set, so the data can be more easily tracked over time. For example a letter grade such as an A- is a KPI summary of all your graded efforts to date.
In this example, marketing wants you to build two KPIs:
KPI 1: Whales
Marketing has decided to group customers into the following categories:
big eaters: Customers who are in the top 25% (i.e., above the 0.75 quantile) for items per person.big spenders: Customers who are in the top 25% for price per person.whale: Customers who are in the top 25% for both items per person and price per person.
KPI 2: Tippers
Marketing has decided to further group customers into the following categories based on their tipping:
light: Customers who are in the botton 25% (i.e., below the 0.25 quantile) by tip percentage.heavy: Customers who are in the top 25% (i.e., above the 0.75 quantile) by tip percentage.
To calculate percentiles we will use the quantile Series method in Pandas. This returns the value at which X% of the data is below the given percentile.
Before we can apply our KPI’s we must write the functions!
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']To categorize whales:
def detect_whale(
items_per_person:float,
price_per_person:float,
items_per_person_75th_pctile:float,
price_per_person_75_pctile:float) -> str:
if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
return 'whale'
if items_per_person > items_per_person_75th_pctile:
return 'big eater'
if price_per_person > price_per_person_75_pctile:
return 'big spender'
return ''Let’s test our function using the quantile method:
# tests
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
print(ppp_75, ipp_75)
assert detect_whale(5, 250, 3, 175) == 'whale'
assert detect_whale(5, 100, 3, 175) == 'big eater'
assert detect_whale(1, 250, 3, 175) == 'big spender'
assert detect_whale(1, 100, 3, 175) == ''158.35666666666668 3.0
Now we want to apply the detect_whale function to the checks DataFrame. But detect_whale requires two columns as input, the items_per_person and price_per_person. Again, we need to use a lambda function, like so:
checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks.sample(25)| check | date | party size | total items on check | total amount of check | gratuity | total_amount_of_check_cleaned | price_per_item | gratuity_cleaned | price_per_person | items_per_person | tip_percentage | whale | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 3010 | 2024-11-14 | 6 | 8 | $758.16 | $181.96 | 758.16 | 94.77 | 181.96 | 126.360000 | 1.333333 | 0.240002 | |
| 47 | 3588 | 2024-08-20 | 1 | 2 | $123.46 | $6.17 | 123.46 | 61.73 | 6.17 | 123.460000 | 2.000000 | 0.049976 | |
| 33 | 3842 | 2024-03-31 | 6 | 6 | $147.12 | $5.88 | 147.12 | 24.52 | 5.88 | 24.520000 | 1.000000 | 0.039967 | |
| 20 | 3393 | 2024-08-26 | 5 | 6 | $302.64 | $24.21 | 302.64 | 50.44 | 24.21 | 60.528000 | 1.200000 | 0.079996 | |
| 37 | 4829 | 2024-12-30 | 9 | 11 | $816.20 | $16.32 | 816.20 | 74.20 | 16.32 | 90.688889 | 1.222222 | 0.019995 | |
| 1 | 2443 | 2024-06-09 | 3 | 10 | $286.40 | $31.50 | 286.40 | 28.64 | 31.50 | 95.466667 | 3.333333 | 0.109986 | big eater |
| 27 | 3653 | 2024-10-29 | 1 | 4 | $72.88 | $16.76 | 72.88 | 18.22 | 16.76 | 72.880000 | 4.000000 | 0.229967 | big eater |
| 18 | 2103 | 2024-02-24 | 8 | 9 | $388.26 | $116.48 | 388.26 | 43.14 | 116.48 | 48.532500 | 1.125000 | 0.300005 | |
| 12 | 4528 | 2024-01-02 | 4 | 6 | $49.98 | $7.50 | 49.98 | 8.33 | 7.50 | 12.495000 | 1.500000 | 0.150060 | |
| 38 | 2341 | 2024-06-03 | 7 | 16 | $1,118.88 | $313.29 | 1118.88 | 69.93 | 313.29 | 159.840000 | 2.285714 | 0.280003 | big spender |
| 43 | 1186 | 2024-09-21 | 5 | 16 | $298.72 | $74.68 | 298.72 | 18.67 | 74.68 | 59.744000 | 3.200000 | 0.250000 | big eater |
| 8 | 1066 | 2024-08-20 | 10 | 22 | $485.76 | $77.72 | 485.76 | 22.08 | 77.72 | 48.576000 | 2.200000 | 0.159997 | |
| 35 | 2486 | 2024-01-27 | 4 | 13 | $569.01 | $108.11 | 569.01 | 43.77 | 108.11 | 142.252500 | 3.250000 | 0.189997 | big eater |
| 5 | 2191 | 2024-01-06 | 1 | 3 | $17.85 | $1.96 | 17.85 | 5.95 | 1.96 | 17.850000 | 3.000000 | 0.109804 | |
| 31 | 1945 | 2024-02-05 | 3 | 7 | $132.86 | $21.26 | 132.86 | 18.98 | 21.26 | 44.286667 | 2.333333 | 0.160018 | |
| 48 | 4161 | 2024-06-22 | 9 | 28 | $1,385.16 | $235.48 | 1385.16 | 49.47 | 235.48 | 153.906667 | 3.111111 | 0.170002 | big eater |
| 2 | 3685 | 2024-12-07 | 5 | 5 | $252.95 | $50.59 | 252.95 | 50.59 | 50.59 | 50.590000 | 1.000000 | 0.200000 | |
| 22 | 1336 | 2024-08-30 | 8 | 28 | $1,199.80 | $275.95 | 1199.80 | 42.85 | 275.95 | 149.975000 | 3.500000 | 0.229997 | big eater |
| 39 | 1707 | 2024-03-24 | 7 | 8 | $341.44 | $47.80 | 341.44 | 42.68 | 47.80 | 48.777143 | 1.142857 | 0.139995 | |
| 49 | 3404 | 2024-07-19 | 9 | 26 | $2,382.90 | $71.49 | 2382.90 | 91.65 | 71.49 | 264.766667 | 2.888889 | 0.030001 | big spender |
| 17 | 3795 | 2024-02-21 | 3 | 7 | $212.38 | $46.72 | 212.38 | 30.34 | 46.72 | 70.793333 | 2.333333 | 0.219983 | |
| 11 | 3693 | 2024-01-18 | 10 | 20 | $1,820.00 | $309.40 | 1820.00 | 91.00 | 309.40 | 182.000000 | 2.000000 | 0.170000 | big spender |
| 28 | 2446 | 2024-12-15 | 4 | 12 | $575.64 | $28.78 | 575.64 | 47.97 | 28.78 | 143.910000 | 3.000000 | 0.049997 | |
| 40 | 2512 | 2024-03-30 | 3 | 12 | $181.56 | $39.94 | 181.56 | 15.13 | 39.94 | 60.520000 | 4.000000 | 0.219982 | big eater |
| 29 | 4590 | 2024-05-08 | 3 | 5 | $220.40 | $22.04 | 220.40 | 44.08 | 22.04 | 73.466667 | 1.666667 | 0.100000 |
Looping over Dataframes
If you must run a for loop over your DataFrames, there are two choices:
df.iterrows()dict-like iterationdf.itertuples()named-tuple like iteration (faster)
Let’s do an example where we display the check number, whale and tipper for “heavy tipper” checks.
## Using the iterrows() method
print("Total Amount of Whale Checks")
for i,row in checks.iterrows():
if row['whale'] == 'whale':
print(i, row['check'], row['total_amount_of_check_cleaned'])# Same example with the itertuples() method
print("Total Amount of Whale Checks")
for row in checks.itertuples():
if row.whale == 'whale':
print(row.check, row.total_amount_of_check_cleaned)# Of course you don't need a loop to do this:
checks[checks['whale'] == 'whale'][['check', 'total_amount_of_check_cleaned']]