3. Data I/O with Pandas

In this tutorial we’ll learn about how to load and save files using Pandas, including how to handle different file formats (csv, json, Excel, etc.).

Pandas reads data in a variety of formats

Examples:

  • Text: CSV / Delimited
    pd.read_csv("file.csv", sep=",", header=0)
  • Semi- Structured: JSON, HTML, XML
    pd.read_json("file.json", orient="records")
  • Microsoft Excel
    pd.read_excel("file.xlsx", sheet_name="Sheet 1")
  • Big Data formats (ORC, Parquet, HDF5)
    pd.read_parquet("file.parquet")
  • SQL Databases

For more details, see the Pandas IO documentation.

Pandas can read from almost anywhere

  • Local files
    pd_read_csv("./folder/file.csv")
  • Files over the network using http / https
    pd.read_csv("https://website/folder/file.csv")
  • File-like: binary / text streams
with open('file.csv', 'r') as file:
    data = file.read()
    df = pd.read_csv(pd.compat.StringIO(data))  # text stream

Reading CSV / Delimited Text

For reading CSV files (or text files with other delimiters, such as tab), we use the read_csv() function.

  • This function is for processing text files one record per line with values separated by a delimiter (typically a ,, but can be any string).
  • Common named arguments:
    • sep= the delimiter, default is a comma.
    • header= Which row, amongst those not skipped is the header
    • names= list of column names to use in the DataFrame
    • skiprows= how many lines to skip before the data begins?

Some examples of reading in the same data in different ways are below. In every case, the output is the same DataFrame:

import pandas as pd
# To view the following files, see:
# https://github.com/mafudge/datasets/tree/master/delimited
location = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited"

# Header is first row, Comma-delimited
students = pd.read_csv(f'{location}/students-header.csv') 

# No header in first row, Comma-delimited
students = pd.read_csv(f'{location}/students-no-header.csv', header=None, names =['Name','Grade','Year'])

# No header in first row, Pipe-delimited  "|"
students = pd.read_csv(f'{location}/students-header.psv', sep="|")

# Header not in first row, header in 6th row, Comma-delimited"
students = pd.read_csv(f'{location}/students-header-blanks.csv', skiprows=5)

# no header, data starts in 6th row, semicolon-delimited"
students = pd.read_csv(f'{location}/students-no-header-blanks.ssv', skiprows=5, header=None, sep=";", names =['Name','Grade','Year'])

students
Name Grade Year
0 Abby 7.0 Freshman
1 Bob 9.0 Sophomore
2 Chris 10.0 Senior
3 Dave 8.0 Freshman
4 Ellen 7.0 Sophomore
5 Fran 10.0 Senior
6 Greg 8.0 Freshman
7 Helen NaN Sophomore
8 Iris 10.0 Senior
9 Jimmy 8.0 Freshman
10 Karen 7.5 Freshman
11 Lynne 10.0 Sophomore
12 Mike 10.0 Sophomore
13 Nico NaN Junior
14 Pete 8.0 Freshman
CautionCode Challenge 4.3.1

Read this file into a Pandas DataFrame:

https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/webtraffic.log

  • What is the delimiter?
  • Is there a header? Which row?
  • Do you need to skip lines?

Display only data where the time taken > 500 (msec) and the sc-status is equal to 200.

Bonus: display the data in a Streamlit app.

import pandas as pd

wt = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/webtraffic.log", skiprows=3, header=0, sep="\s+")
wt.info() # colunmn info (to console only)

wt_filter = (wt['sc-status'] == 200) & ( wt['time-taken'] > 500)
wt_slow_but_successful = wt[wt_filter]
wt_slow_but_successful
<class 'pandas.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   date             489 non-null    str  
 1   time             489 non-null    str  
 2   s-ip             489 non-null    str  
 3   cs-method        489 non-null    str  
 4   cs-uri-stem      489 non-null    str  
 5   cs-uri-query     489 non-null    str  
 6   s-port           489 non-null    int64
 7   cs-username      489 non-null    str  
 8   c-ip             489 non-null    str  
 9   cs(User-Agent)   489 non-null    str  
 10  cs(Referer)      489 non-null    str  
 11  sc-status        489 non-null    int64
 12  sc-substatus     489 non-null    int64
 13  sc-win32-status  489 non-null    int64
 14  time-taken       489 non-null    int64
dtypes: int64(5), str(10)
memory usage: 57.4 KB
<>:3: SyntaxWarning: invalid escape sequence '\s'
<>:3: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_10183/568368627.py:3: SyntaxWarning: invalid escape sequence '\s'
  wt = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/webtraffic.log", skiprows=3, header=0, sep="\s+")
date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
32 2016-02-11 17:16:17 128.230.247.37 GET /desktops - 80 - 215.82.23.2 Mozilla/5.0+(Windows+NT+10.0;+WOW64;+rv:43.0)+... http://group0.ist722.ischool.syr.edu/ 200 0 0 1144
93 2016-02-11 17:16:59 128.230.247.37 GET /adidas-consortium-campus-80s-running-shoes - 80 - 128.122.140.238 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi... http://group0.ist722.ischool.syr.edu/shoes 200 0 0 701
105 2016-02-11 17:17:07 128.230.247.37 GET /htc-one-m8-android-l-50-lollipop - 80 - 128.122.140.238 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi... http://group0.ist722.ischool.syr.edu/ 200 0 0 613
158 2016-02-11 18:03:08 128.230.247.37 GET /digital-downloads - 80 - 74.111.6.173 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi... http://group0.ist722.ischool.syr.edu/register 200 0 0 572
184 2016-02-11 18:03:58 128.230.247.37 GET /jewelry - 80 - 74.111.6.173 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi... http://group0.ist722.ischool.syr.edu/cart 200 0 0 516
281 2016-02-11 18:07:48 128.230.247.37 GET /electronics - 80 - 172.189.252.8 Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_3... http://group0.ist722.ischool.syr.edu/ 200 0 0 704
293 2016-02-11 18:07:50 128.230.247.37 GET /apparel - 80 - 172.189.252.8 Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_3... http://group0.ist722.ischool.syr.edu/electronics 200 0 0 645
303 2016-02-11 18:07:53 128.230.247.37 GET /computers - 80 - 172.189.252.8 Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_3... http://group0.ist722.ischool.syr.edu/apparel 200 0 0 703
399 2016-02-11 19:55:38 128.230.247.37 GET /if-you-wait - 80 - 8.37.70.99 AddThis.com+(http://support.addthis.com/) - 200 0 0 539

To display in a Streamlit app, put the above in a script file (say, webtraffic.py) and add the following:

import streamlit as st

st.title("Webtraffic Data")
st.dataframe(wt_slow_but_successful) # first 20 rows

Then in a terminal run:

python -m streamlit run webtraffic.py

Reading JSON Text

To load JSON files as a Pandas DataFrame we use the read_json() function. Examples:

  • pd.read_json("file.json", orient="columns")

  • pd.read_json("file.json", orient="records", lines=True) <== Line-oriented json

Orientations: - split: dict like {index -> [index]; columns -> [columns]; data -> [values]}

For more on reading JSON files, see the Pandas Reading JSON guide.

Some examples of reading in the same JSON data in different ways follows. In every case, the output is the same DataFrame:

# https://github.com/mafudge/datasets/tree/master/json-formats to view the files
location = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-formats"

# Row-oriented JSON [ { "Name": "Alice", "Grade": 12, "Year": 2021 }, { "Name": "Bob", "Grade": 11, "Year": 2022 } ]
students = pd.read_json(f'{location}/students-records.json', orient='records')

# line-oriented JSON { "Name": "Alice", "Grade": 12, "Year": 2021 }\n { "Name": "Bob", "Grade": 11, "Year": 2022 }\n
students = pd.read_json(f'{location}/students-lines.json', orient='records', lines=True)

# column-oriented JSON { "Name": ["Alice", "Bob"], "Grade": [12, 11], "Year": [2021, 2022] }
students = pd.read_json(f'{location}/students-columns.json', orient='columns')

students
Name Grade Year
0 Abby 7.0 Freshman
1 Bob 9.0 Sophomore
2 Chris 10.0 Senior
3 Dave 8.0 Freshman
4 Ellen 7.0 Sophomore
5 Fran 10.0 Senior
6 Greg 8.0 Freshman
7 Helen NaN Sophomore
8 Iris 10.0 Senior
9 Jimmy 8.0 Freshman
10 Karen 7.5 Freshman
11 Lynne 10.0 Sophomore
12 Mike 10.0 Sophomore
13 Nico NaN Junior
14 Pete 8.0 Freshman

Handling Nested JSON

The read_json() method does not perform well on nested JSON structures. For example consider the following JSON file of customer orders:

The file orders.json:

[
    {
        "Customer" : { "FirstName" : "Abby", "LastName" : "Kuss"}, 
        "Items" : [
            { "Name" : "T-Shirt", "Price" : 10.0, "Quantity" : 3},
            { "Name" : "Jacket", "Price" : 20.0, "Quantity" : 1}
        ]
    },
    {
        "Customer" : { "FirstName" : "Bette", "LastName" : "Alott"}, 
        "Items" : [
            { "Name" : "Shoes", "Price" : 25.0, "Quantity" : 1}, 
            { "Name" : "Jacket", "Price" : 20.0, "Quantity" : 1}
        ]
    },
    {
        "Customer" : { "FirstName" : "Chris", "LastName" : "Peanugget"}, 
        "Items" : [
            { "Name" : "T-Shirt", "Price" : 10.0, "Quantity" : 1}
        ]
    }
]

When we read this with read_json() we get the three orders but only two columns — one for the "Customer" key, and the other for the "Items" key:

orders = pd.read_json("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/orders.json")
orders
Customer Items
0 {'FirstName': 'Abby', 'LastName': 'Kuss'} [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'...
1 {'FirstName': 'Bette', 'LastName': 'Alott'} [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': ...
2 {'FirstName': 'Chris', 'LastName': 'Peanugget'} [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'...

What we want is one row per item on the the order and the customer name to be in separate columns. The json_normalize() function can help here.

Note

It is important to note that json_normalize() does not take a file as input, but rather de-serialized json (i.e., a dict or list of dicts).

An example (note that we first need to load the file as JSON dict; for that, we’ll use the requests module to download the data):

# first down load the data
import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/orders.json")
json_data = response.json()  #de-serialize
print('Original JSON data:')
print(json_data)
# now load into a DataFrame
orders = pd.json_normalize(json_data)
print("\nLoaded DataFrame:")
orders
Original JSON data:
[{'Customer': {'FirstName': 'Abby', 'LastName': 'Kuss'}, 'Items': [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity': 3}, {'Name': 'Jacket', 'Price': 20.0, 'Quantity': 1}]}, {'Customer': {'FirstName': 'Bette', 'LastName': 'Alott'}, 'Items': [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': 1}, {'Name': 'Jacket', 'Price': 20.0, 'Quantity': 1}]}, {'Customer': {'FirstName': 'Chris', 'LastName': 'Peanugget'}, 'Items': [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity': 1}]}]

Loaded DataFrame:
Items Customer.FirstName Customer.LastName
0 [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'... Abby Kuss
1 [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': ... Bette Alott
2 [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'... Chris Peanugget

Better but this only processed nested dict and not nested list. We still need to handle the list of Items. To accomplish this we :

  1. Set the record_path to be the nested list 'Items'. This tells json_normalize() to use that JSON key as the row level. So now we will have 5 rows (one for each item) instead of 3.
  2. Then we set the meta named argument to a list of each of the other values we wish to include, in this instance last name and first name.

Note: The meta syntax is a bit weird. It’s a list of JSON paths (also represented as lists) to each item in the JSON. For example:

The meta Argument        ==> Matches This in the JSON           ==> And Displays As This Pandas Column
["Customer","FirstName"] ==> { "Customer" : { "FirstName": ...} ==> Customer.Firstname
orders = pd.json_normalize(json_data, record_path="Items", meta=[["Customer","FirstName"],["Customer","LastName"]])
orders
Name Price Quantity Customer.FirstName Customer.LastName
0 T-Shirt 10.0 3 Abby Kuss
1 Jacket 20.0 1 Abby Kuss
2 Shoes 25.0 1 Bette Alott
3 Jacket 20.0 1 Bette Alott
4 T-Shirt 10.0 1 Chris Peanugget

Yes it seems complicated, because conceptually it is a bit complicated. Let’s try another example, with some abstract values.

In the following example we want to generate a normalized table with 3 rows and 4 columns.

  • The rows are based on the "A" record_path, which has two sub-sets, A1 and A2. There are three sets of A data: (101, 102); (111, 112); and (201, 202).
  • The meta data are based on columns "B", and "C1"
json_data = [
    {
        "A": [
            {"A1": 101, "A2": 102},
            {"A1": 111, "A2": 112}
        ],
        "B": 103,
        "C": {"C1": 104}
    },
    {
        "A": [
            {"A1": 201, "A2": 202}
        ],
        "B": 203,
        "C": {"C1": 204}
    }
]

df = pd.json_normalize(json_data, record_path="A", meta=["B", ["C", "C1"]])
df
A1 A2 B C.C1
0 101 102 103 104
1 111 112 103 104
2 201 202 203 204
CautionCode Challenge 4.3.2

Use the json_normalize function to tabularize this JSON data:

https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees.json

The final table should have these columns: dept, age, firstname, lastname.

Hint: read the file using the requests module, like in the above example.

import requests
import pandas as pd

response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/employees.json")
employees = response.json()
employees_df = pd.json_normalize(employees, record_path=["employees"], meta=["dept"])
employees_df
firstName lastName age dept
0 John Doe 23 accounting
1 Mary Smith 32 accounting
2 Sally Green 27 sales
3 Jim Galley 41 sales

Reading Excel files

Excel files can be read using the read_excel() function. For example: pd.read_excel('file.xlsx', sheet_name=None)

Note

In order to use the read_excel method, you need to additional install the optional Pandas dependency openpyxl. To do that using pip, run:

pip install openpyxl

This will read in all sheets as a dict, with the sheet names as the keys and the values as Pandas DataFrames representing the contents. An example using this with Streamlit:

import streamlit as st
import pandas as pd

st.title("Excel Example - multiple sheets")

contents = pd.read_excel("https://github.com/mafudge/datasets/raw/refs/heads/master/excel-examples/books_of_interest.xlsx", sheet_name=None)

# names of sheets in the excel file its a dictionary
sheets = list(contents.keys()) 

# make tabs for each sheet
tabs = st.tabs(sheets)

#loop through each tab and write the contents of the sheet to the tab
for i in range(len(tabs)):
    df = contents[sheets[i]]
    tabs[i].dataframe(df)

Reading HTML Tables

You can scrape an HTML table off a webpage using the read_html() function. This will return a list of all HTML tables on the page, with each table as a DataFrame.

Note

In order to use the read_html method, you need to additional install the optional Pandas dependency lxml. To do that using pip, run:

pip install lxml 

For example, here’s an example of loading all the tables on the course website’s front page:

contents = pd.read_html("https://su-ist356-m003-spring-2026.github.io/course-home")
for df in contents:
    print(df)
   Week       Dates                                          Topic
0     1  1/12, 1/14                           Intro; CLI and Conda
1     2        1/21                                   Git & GitHub
2     3  1/26, 1/28                              Python review - 1
3     4    2/2, 2/4                              Python review - 2
4     5   2/9, 2/11       Writing scripts with Argparse; Streamlit
5     6  2/16, 2/18                             Data wrangling - 1
6     7  2/23, 2/25                             Data wrangling - 2
7     8    3/2, 3/4           Review (3/2) and Mid-term Exam (3/4)
8     –  3/8 - 3/15                                   Spring Break
9     9  3/16, 3/18                                   Web APIs - 1
10   10  3/23, 3/25                                   Web APIs - 2
11   11   3/30, 4/1                               Web scraping - 1
12   12    4/6, 4/8                               Web scraping - 2
13   13  4/13, 4/15                         Data visualization - 1
14   14  4/20, 4/22  Data visualization - 2; Review for Final Exam
15   15        4/27                                     Final Exam
16   16         5/4                                    Project due
         Date                                         Unnamed: 1
0   Mon. 1/12                                 First day of class
1   Mon. 1/19              Martin Luther King, Jr. Day: No class
2    Mon. 2/2  Academic/Financial drop deadline; Religious ob...
3  3/8 - 3/15                            Spring break - No class
4  Tues. 4/14                                Withdrawal deadline
5   Mon. 4/27                                  Last day of class

An example of turning this into a Streamlit app:

import streamlit as st
import pandas as pd

st.title("HTML Example - multiple tables")

url = "https://su-ist356-m003-spring-2026.github.io/course-home"
contents = pd.read_html(url)

# There are 2 tables on this page, but we don't know this
tables_count = len(contents)

st.write(f"Found {tables_count} tables on the page")

# make tabs for each HTML Table
tab_names = [ f"HTML Table {i}" for i in range(tables_count)]
tabs = st.tabs(tab_names)

# for each tab, show its table
for i in range(len(tabs)):
    df = contents[i]
    tabs[i].dataframe(df)

Writing Dataframes

  • Once the data is in a pd.DataFrame is can be written out with one of the to() methods such as to_csv(), to_json(), to_parquet() etc.
  • This makes pandas a superior data conversion tool.
  • If you include a file, the to() method writes to the file, otherwise the binary contents are returned.
  • https://pandas.pydata.org/pandas-docs/stable/reference/io.html

An example of converting the above Excel spreadsheet to CSV:

contents = pd.read_excel("https://github.com/mafudge/datasets/raw/refs/heads/master/excel-examples/books_of_interest.xlsx", sheet_name=None)
for sheetname, df in contents.items():
    # convert spaces to underscores for filenames
    sheetname = sheetname.replace(' ', '_')
    filename = f'books_of_interest-{sheetname}.csv'
    df.to_csv(filename, header=True, index=False)
CautionCode Challenge 4.3.3

Write a Streamlit app that will accept an Excel file via file uploader and then write out a record-oriented JSON file from the first tab in the excel file.

The program should display the contents of the dataframe and provide a download button for the converted the csv file. Hints: - To provide the ability to upload a file, see: st.file_uploader - To provide the ability to download a file, see: st.download_button

import requests
import pandas as pd
import streamlit as st

st.title("Excel to JSON")

uploaded_file = st.file_uploader("Upload an EXCEL file", type=["xlsx"])

if uploaded_file:
    df = pd.read_excel(uploaded_file.getvalue())
    st.dataframe(df)
    json_file = df.to_json(orient="records", index=False)
    json_filename = uploaded_file.name.replace('.xlsx', '.json')
    download = st.download_button(f"Download {json_filename}", data=json_file, file_name=json_filename)