5. Basic data cleaning with Pandas

In this lesson we will start learning how to clean a dataframe data and loop over it.

For this tutorial, we will be using some data representing a pretend restaurant’s transactions. The file is:

checks_data = 'https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv'

This file is also provided in your class notebook repository, in data-wrangling/data/delimited.

Let’s load the file with Pandas and sample a few of its rows to see what it contains:

import pandas as pd

checks = pd.read_csv(checks_data)
checks.sample(10)

	check	date	party size	total items on check	total amount of check	gratuity
33	3842	2024-03-31	6	6	$147.12	$5.88
30	2705	2024-07-08	10	19	$838.85	$671.08
45	4031	2024-07-12	9	13	$129.74	$29.84
40	2512	2024-03-30	3	12	$181.56	$39.94
13	3867	2024-05-02	4	14	$499.10	$119.78
19	3718	2024-10-30	2	5	$464.70	$120.82
24	4310	2024-11-30	10	34	$3,262.30	$913.44
43	1186	2024-09-21	5	16	$298.72	$74.68
6	2527	2024-03-27	6	21	$921.48	$55.29
42	1361	2024-11-21	7	14	$65.80	$16.45

Let’s use the info method to get some more information about the columns:

checks.info()

<class 'pandas.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   check                  50 non-null     int64
 1   date                   50 non-null     str  
 2   party size             50 non-null     int64
 3   total items on check   50 non-null     int64
 4   total amount of check  50 non-null     str  
 5   gratuity               50 non-null     str  
dtypes: int64(3), str(3)
memory usage: 2.5 KB

There’s something odd here! Note that the data type of some of the columns (e.g., total amount of check) are object instead of floats, as you might expect. As we’ll see below, this is because of the $ in the values; that will cause issues when we try to work with these columns. We’ll learn how to “clean” these columns so that we can do useful things with them.

Apply

The apply method allows us to execute a function over a Series or the entire DataFrame.

The general syntax:

Series.apply(func) <== call function func for every item in the Series.
DataFrame.apply(func, axis=1) <== call function func for every row in the DataFrame (axis=1 => row).
DataFrame.apply(func, axis=0) <== call function func for every column in the DataFrame (axis=0 => col).

Note that the first argument apply is the function itself, not the function applied to some data. For example, suppose we define a function called sq that squares the input values:

def sq(x):
    return x**2

To apply this to one of the columns in our DataFrame (say, the party size column):

checks['party size'].apply(sq)

0      64
1       9
2      25
3       4
4      36
5       1
6      36
7      64
8     100
9       1
10     36
11    100
12     16
13     16
14      1
15     25
16     25
17      9
18     64
19      4
20     25
21      1
22     64
23      4
24    100
25     36
26     36
27      1
28     16
29      9
30    100
31      9
32      9
33     36
34    100
35     16
36      4
37     81
38     49
39     49
40      9
41     49
42     49
43     25
44     49
45     81
46      1
47      1
48     81
49     81
Name: party size, dtype: int64

Why `apply`?

In the above example, you might wonder why we don’t just apply the function directly to the Series, rather than use apply. Afterall, in the first pandas tutorial we learned that Series are vectorized just like numpy arrays. In other words, why not just do:

sq(checks['party size'])

0      64
1       9
2      25
3       4
4      36
5       1
6      36
7      64
8     100
9       1
10     36
11    100
12     16
13     16
14      1
15     25
16     25
17      9
18     64
19      4
20     25
21      1
22     64
23      4
24    100
25     36
26     36
27      1
28     16
29      9
30    100
31      9
32      9
33     36
34    100
35     16
36      4
37     81
38     49
39     49
40      9
41     49
42     49
43     25
44     49
45     81
46      1
47      1
48     81
49     81
Name: party size, dtype: int64

In this case, you could just run the function on the series. Where apply is useful is when you have more complicated functions, in particular, ones that need to do different things depending on what the input data is. For example, suppose we define the following function to group parties into small, medium, and large depending on how many people are in the party:

def classify_size(x):
    if x < 4:
        return 'small'
    elif x < 8:
        return 'medium'
    else:
        return 'large'

If we try to run this on the party size Series, we get an error:

classify_size(checks['party size'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_10217/1113461677.py in ?()
----> 1 classify_size(checks['party size'])

/tmp/ipykernel_10217/2519828553.py in ?(x)
      1 def classify_size(x):
----> 2     if x < 4:
      3         return 'small'
      4     elif x < 8:
      5         return 'medium'

/opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/generic.py in ?(self)
   1511     @final
   1512     def __bool__(self) -> NoReturn:
-> 1513         raise ValueError(
   1514             f"The truth value of a {type(self).__name__} is ambiguous. "
   1515             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1516         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This is because if statements cannot be vectorized like this. In contrast, the apply method does allow us to apply the function to the series:

checks['party size'].apply(classify_size)

0      large
1      small
2     medium
3      small
4     medium
5      small
6     medium
7      large
8      large
9      small
10    medium
11     large
12    medium
13    medium
14     small
15    medium
16    medium
17     small
18     large
19     small
20    medium
21     small
22     large
23     small
24     large
25    medium
26    medium
27     small
28    medium
29     small
30     large
31     small
32     small
33    medium
34     large
35    medium
36     small
37     large
38    medium
39    medium
40     small
41    medium
42    medium
43    medium
44    medium
45     large
46     small
47     small
48     large
49     large
Name: party size, dtype: str

This is because the apply method takes care to cycle over every element in the series and apply the function.

Cleaning data with `apply`

The ability of apply to apply more complicated functions involving if statements makes it extremely useful for cleaning datasets. By “cleaning” we mean reformatting data and/or removing spurious values, so that we can use it without issue.

For example, say we want to add price per item to our DataFrame, defined as:

price per item = total amount of check / total items on check

The problem is total amount of check is an object, not a float. This means we cannot do math on it:

# This will raise a TypeError because of the dollar sign and commas!!!
checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 # This will raise a TypeError because of the dollar sign and commas!!!
----> 2 checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:85, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     82     other = sanitize_array(other, None)
     83     other = ensure_wrapped_if_datetimelike(other)
---> 85 return method(self, other)

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:214, in OpsMixin.__truediv__(self, other)
    212 @unpack_zerodim_and_defer("__truediv__")
    213 def __truediv__(self, other):
--> 214     return self._arith_method(other, operator.truediv)

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/series.py:6751, in Series._arith_method(self, other, op)
   6749 def _arith_method(self, other, op):
   6750     self, other = self._align_for_op(other)
-> 6751     return base.IndexOpsMixin._arith_method(self, other, op)

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/base.py:1644, in IndexOpsMixin._arith_method(self, other, op)
   1641     rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step)
   1643 with np.errstate(all="ignore"):
-> 1644     result = ops.arithmetic_op(lvalues, rvalues, op)
   1646 return self._construct_result(result, name=res_name, other=other)

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:279, in arithmetic_op(left, right, op)
    270 right = ensure_wrapped_if_datetimelike(right)
    272 if (
    273     should_extension_dispatch(left, right)
    274     or isinstance(right, (Timedelta, BaseOffset, Timestamp))
   (...)    277     # Timedelta/Timestamp and other custom scalars are included in the check
    278     # because numexpr will fail on it, see GH#31457
--> 279     res_values = op(left, right)
    280 else:
    281     # TODO we should handle EAs consistently and move this check before the if/else
    282     # (https://github.com/pandas-dev/pandas/issues/41165)
    283     # error: Argument 2 to "_bool_arith_check" has incompatible type
    284     # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]"
    285     _bool_arith_check(op, left, right)  # type: ignore[arg-type]

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:85, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     82     other = sanitize_array(other, None)
     83     other = ensure_wrapped_if_datetimelike(other)
---> 85 return method(self, other)

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:214, in OpsMixin.__truediv__(self, other)
    212 @unpack_zerodim_and_defer("__truediv__")
    213 def __truediv__(self, other):
--> 214     return self._arith_method(other, operator.truediv)

File /opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/pandas/core/arrays/string_.py:1212, in StringArray._cmp_method(self, other, op)
   1210 result = np.empty_like(self._ndarray, dtype="object")
   1211 result[mask] = self.dtype.na_value
-> 1212 result[valid] = op(self._ndarray[valid], other)
   1213 if isinstance(other, Path):
   1214     # GH#61940
   1215     return result

TypeError: unsupported operand type(s) for /: 'str' and 'int'

How do we fix this? Let’s write a function to convert string values like this: $4,590.45 into floats like this: 4590.45

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))


# tests
assert clean_currency('$1,000.00') == 1000.00
assert clean_currency('$1,000') == 1000.00
assert clean_currency('1,000') == 1000.00
assert clean_currency('$1000') == 1000.00

With our function written we can use apply() to transform the series:

checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks.sample(10)

	check	date	party size	total items on check	total amount of check	gratuity	total_amount_of_check_cleaned	price_per_item
12	4528	2024-01-02	4	6	$49.98	$7.50	49.98	8.33
28	2446	2024-12-15	4	12	$575.64	$28.78	575.64	47.97
7	1564	2024-09-23	8	11	$928.40	$204.25	928.40	84.40
15	2386	2024-03-31	5	12	$1,147.80	$137.74	1147.80	95.65
48	4161	2024-06-22	9	28	$1,385.16	$235.48	1385.16	49.47
46	3621	2024-06-23	1	2	$138.76	$19.43	138.76	69.38
23	1194	2024-07-06	2	6	$453.06	$72.49	453.06	75.51
39	1707	2024-03-24	7	8	$341.44	$47.80	341.44	42.68
3	1957	2024-02-15	2	2	$42.44	$8.91	42.44	21.22
32	1440	2024-11-30	3	8	$589.04	$141.37	589.04	73.63

Note

Remember its a really good idea to track lineage when you are building a data pipeline.

NEVER replace columns, always create new ones.

Code Challenge 4.5.1

Let’s take what we did so far, and create a dataset that would be better prepared for analysis / machine learning. We’ll display it in a Streamlit app.

Create a script called st-cleaned_checks.py.
Add your clean_currency function to the script.
Load the checks dataset into a Pandas DataFrame and:
- Clean the total amount of check and gratuity columns.
- Calculate the price_per_item as total amount of check / total items on check.
- Calcualte the price_per_person as total amont of check / party size.
- Calcualte the items_per_person as total items on check / party size.
- Calcualte the tip_percentage as the total amount of check / gratuity.
Display the cleaned DataFrame along with a description of its data (i.e., display the output of describe) in a Streamlit app.
Test that your app works by running it.

Solution

The st-cleaned_checks.py script:

import streamlit as st
import pandas as pd

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))

st.title("Dining Check Data")

# load
checks = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv')

# transformations
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

st.dataframe(checks, width=1000)

st.header("Summary:")
st.dataframe(checks.describe())

Test it by running the following in the terminal (make sure to activate your conda environment first):

python -m streamlit run st-cleaned_checks.py

Using `lambdas` to apply functions to multiple columns

So far we’ve used apply with functions that take in a single Series. What do we do if we need a function to operate on multiple columns in a DataFrame? For that, we can use Python lambda functions.

A simple example

Suppose we have the following dataset of grades for some students:

students = pd.DataFrame({'hw_avg': [85, 92, 99, 78],
                         'final_exam': [88, 95, None, 75]},
                        index=['Adam', 'Brenda', 'Charlie', 'Dina'])
students

	hw_avg	final_exam
Adam	85	88.0
Brenda	92	95.0
Charlie	99	NaN
Dina	78	75.0

We want to calculate their grade for the semester using the following function:

def grade(hw_avg, final_exam):
    if pd.isna(final_exam):
        return 'F'
    overall = 0.4*hw_avg + 0.6*final_exam
    if overall >= 90:
        return 'A'
    elif overall >= 80:
        return 'B'
    elif overall >= 70:
        return 'C'
    elif overall >= 60:
        return 'D'
    else:
        return 'F'

Since this function involves if statements, we need to use apply in order to run it on our students DataFrame. But grade needs two columns as input, hw_avg and final_exam. How do we do that?

We can use a lambda function to quickly define a small function that will return the required columns when provided a row. In general, the syntax for a lambda function is lambda ARGS: FUNC. For example, lambda a, b: a+b will return the sum of the two arguments its given.

In our case, we can use a lambda function to pull out the needed columns and give them to apply, like so:

students['grade'] = students.apply(lambda row: grade(row['hw_avg'], row['final_exam']),
                                   axis=1)
students

	hw_avg	final_exam	grade
Adam	85	88.0	B
Brenda	92	95.0	A
Charlie	99	NaN	F
Dina	78	75.0	C

An example using our checks data

Marketing wants you to build some key performance indicators (KPIs) using the checks data. A KPI is a statistic that summarizes some larger data set, so the data can be more easily tracked over time. For example a letter grade such as an A- is a KPI summary of all your graded efforts to date.

In this example, marketing wants you to build two KPIs:

KPI 1: Whales

Marketing has decided to group customers into the following categories:

big eaters: Customers who are in the top 25% (i.e., above the 0.75 quantile) for items per person.
big spenders: Customers who are in the top 25% for price per person.
whale: Customers who are in the top 25% for both items per person and price per person.

KPI 2: Tippers

Marketing has decided to further group customers into the following categories based on their tipping:

light: Customers who are in the botton 25% (i.e., below the 0.25 quantile) by tip percentage.
heavy: Customers who are in the top 25% (i.e., above the 0.75 quantile) by tip percentage.

To calculate percentiles we will use the quantile Series method in Pandas. This returns the value at which X% of the data is below the given percentile.

Before we can apply our KPI’s we must write the functions!

checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

To categorize whales:

def detect_whale(
        items_per_person:float, 
        price_per_person:float, 
        items_per_person_75th_pctile:float, 
        price_per_person_75_pctile:float) -> str:
    if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
        return 'whale'
    if items_per_person > items_per_person_75th_pctile:
        return 'big eater'
    if price_per_person > price_per_person_75_pctile:
        return 'big spender'
    return ''

Let’s test our function using the quantile method:

# tests
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
print(ppp_75, ipp_75)
assert detect_whale(5, 250, 3, 175) == 'whale'
assert detect_whale(5, 100, 3, 175) == 'big eater'
assert detect_whale(1, 250, 3, 175) == 'big spender'
assert detect_whale(1, 100, 3, 175) == ''

158.35666666666668 3.0

Now we want to apply the detect_whale function to the checks DataFrame. But detect_whale requires two columns as input, the items_per_person and price_per_person. Again, we need to use a lambda function, like so:

checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks.sample(25)

	check	date	party size	total items on check	total amount of check	gratuity	total_amount_of_check_cleaned	price_per_item	gratuity_cleaned	price_per_person	items_per_person	tip_percentage	whale
4	3010	2024-11-14	6	8	$758.16	$181.96	758.16	94.77	181.96	126.360000	1.333333	0.240002
47	3588	2024-08-20	1	2	$123.46	$6.17	123.46	61.73	6.17	123.460000	2.000000	0.049976
33	3842	2024-03-31	6	6	$147.12	$5.88	147.12	24.52	5.88	24.520000	1.000000	0.039967
20	3393	2024-08-26	5	6	$302.64	$24.21	302.64	50.44	24.21	60.528000	1.200000	0.079996
37	4829	2024-12-30	9	11	$816.20	$16.32	816.20	74.20	16.32	90.688889	1.222222	0.019995
1	2443	2024-06-09	3	10	$286.40	$31.50	286.40	28.64	31.50	95.466667	3.333333	0.109986	big eater
27	3653	2024-10-29	1	4	$72.88	$16.76	72.88	18.22	16.76	72.880000	4.000000	0.229967	big eater
18	2103	2024-02-24	8	9	$388.26	$116.48	388.26	43.14	116.48	48.532500	1.125000	0.300005
12	4528	2024-01-02	4	6	$49.98	$7.50	49.98	8.33	7.50	12.495000	1.500000	0.150060
38	2341	2024-06-03	7	16	$1,118.88	$313.29	1118.88	69.93	313.29	159.840000	2.285714	0.280003	big spender
43	1186	2024-09-21	5	16	$298.72	$74.68	298.72	18.67	74.68	59.744000	3.200000	0.250000	big eater
8	1066	2024-08-20	10	22	$485.76	$77.72	485.76	22.08	77.72	48.576000	2.200000	0.159997
35	2486	2024-01-27	4	13	$569.01	$108.11	569.01	43.77	108.11	142.252500	3.250000	0.189997	big eater
5	2191	2024-01-06	1	3	$17.85	$1.96	17.85	5.95	1.96	17.850000	3.000000	0.109804
31	1945	2024-02-05	3	7	$132.86	$21.26	132.86	18.98	21.26	44.286667	2.333333	0.160018
48	4161	2024-06-22	9	28	$1,385.16	$235.48	1385.16	49.47	235.48	153.906667	3.111111	0.170002	big eater
2	3685	2024-12-07	5	5	$252.95	$50.59	252.95	50.59	50.59	50.590000	1.000000	0.200000
22	1336	2024-08-30	8	28	$1,199.80	$275.95	1199.80	42.85	275.95	149.975000	3.500000	0.229997	big eater
39	1707	2024-03-24	7	8	$341.44	$47.80	341.44	42.68	47.80	48.777143	1.142857	0.139995
49	3404	2024-07-19	9	26	$2,382.90	$71.49	2382.90	91.65	71.49	264.766667	2.888889	0.030001	big spender
17	3795	2024-02-21	3	7	$212.38	$46.72	212.38	30.34	46.72	70.793333	2.333333	0.219983
11	3693	2024-01-18	10	20	$1,820.00	$309.40	1820.00	91.00	309.40	182.000000	2.000000	0.170000	big spender
28	2446	2024-12-15	4	12	$575.64	$28.78	575.64	47.97	28.78	143.910000	3.000000	0.049997
40	2512	2024-03-30	3	12	$181.56	$39.94	181.56	15.13	39.94	60.520000	4.000000	0.219982	big eater
29	4590	2024-05-08	3	5	$220.40	$22.04	220.40	44.08	22.04	73.466667	1.666667	0.100000

Code Challenge 4.5.2

Let’s build on the Streamlit app you made in Challenge 4.5.1, and add our KPIs to it.

Copy st-cleaned_checks.py to a new file called st-checks_kpi.py.
Add the detect_whale function defined above to st-checks_kpi.py.
Write a new function detect_tipper(tip_pct, tip_pcy_75th_pctile, tip_pct_25_pctile) that should return either “light” or “heavy” depending on whether the tip is below the 0.25 quantile or above the 0.75 quantile, respectively. If neither, the function should just return an empty string.
Apply the detect_whale and detect_tipper functions to the DataFrame, storing the results in new columns called whale and tipper, respectively.
As with before, diplay the modified DataFrame and the output of describe in a Streamlit app.

Solution

It’s easiest to do the copy using the command line. In the terminal, run:

cp st-cleaned_checks.py st-checks_kpi.py

Now modify st-checks_kpi.py. It should look like:

import streamlit as st
import pandas as pd

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))

def detect_whale(
        items_per_person:float, 
        price_per_person:float, 
        items_per_person_75th_pctile:float, 
        price_per_person_75_pctile:float) -> str:
    if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
        return 'whale'
    if items_per_person > items_per_person_75th_pctile:
        return 'big eater'
    if price_per_person > price_per_person_75_pctile:
        return 'big spender'
    
    return ''


def detect_tipper(tip_pct:float, tip_pct_75th_pctile:float, tip_pct_25th_pctile:float) -> str:
    if tip_pct > tip_pct_75th_pctile:
        return 'heavy tipper'
    if tip_pct < tip_pct_25th_pctile:
        return 'light tipper'
    return ''

st.title("Dining Check Data")

# load
checks = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv')

# transformations
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

## The new stuff adding KPIs:
# get KPI boundaries
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
tp_75 = checks['tip_percentage'].quantile(0.75)
tp_25 = checks['tip_percentage'].quantile(0.25)

# Calcualte KPI's
checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks['tipper'] = checks.apply(lambda row: detect_tipper(row['tip_percentage'], tp_75, tp_25), axis=1)

# Now display
st.dataframe(checks, width=1000)

st.header("Summary:")
st.dataframe(checks.describe())

Looping over Dataframes

If you must run a for loop over your DataFrames, there are two choices:

df.iterrows() dict-like iteration
df.itertuples() named-tuple like iteration (faster)

Let’s do an example where we display the check number, whale and tipper for “heavy tipper” checks.

## Using the iterrows() method
print("Total Amount of Whale Checks")
for i,row in checks.iterrows():
    if row['whale'] == 'whale':
        print(i, row['check'], row['total_amount_of_check_cleaned'])

# Same example with the itertuples() method
print("Total Amount of Whale Checks")
for row in checks.itertuples():
    if row.whale == 'whale':
        print(row.check, row.total_amount_of_check_cleaned)

# Of course you don't need a loop to do this:
checks[checks['whale'] == 'whale'][['check', 'total_amount_of_check_cleaned']]

Apply

Why apply?

Cleaning data with apply

Using lambdas to apply functions to multiple columns

A simple example

An example using our checks data

Looping over Dataframes

Why `apply`?

Cleaning data with `apply`

Using `lambdas` to apply functions to multiple columns