2. More Scraping + actionability

Using locators to find elements on the page is a fundamental part of web scraping. In this notebook, we’ll learn how to use Playwright to find elements on the page using different types of locators.

Note

In this tutorial we will be diving more into the use of CSS selectors to scrape elements from a website. Our focus here is on how to do this with Playwright in Python. Getting good at web-scraping involves more in-depth use of CSS selectors than what we cover here. A good tutorial on the various CSS selectors and how to use query them when web-scraping can be found here:

ScrapingBee: Using CSS Selectors for Web Scraping

Scraping HTML tables

We saw previously that it’s easy to scrape HTML tables into a pandas dataframe using pd.read_html.

Previously, we provided read_html a URL. However, you can also give the read_url a page as read by Playwright. Here’s an example:

# pw-pdtable.py

from io import StringIO
from playwright.sync_api import Playwright, sync_playwright
import pandas as pd

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://su-ist356-m003-spring-2026.github.io/course-home/syllabus.html")

    # ---------------------
    # use pandas read_html to parse the HTML
    # get a list of all tables on the page
    dfs = pd.read_html(StringIO(page.content()))

    # print the first table
    print(dfs[0])
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Scraping the next adjcent element

Sometimes you need to use one selector to find the element, but what we want is to scrape the next element right after the page.

To find the next adjacent sibling element, you use: .query_selector('~ *').

Here’s an example of using this to select the first element in the Course Info section of the course syllabus (use the Inspect tool in your web browser to understand what’s going on here):

# pw-scrape_next_example.py
from playwright.sync_api import Playwright, sync_playwright


def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://su-ist356-m003-spring-2026.github.io/course-home/syllabus.html")
    
    # ---------------------
    # Let's get course info from the syllabus
    info_details = page.query_selector("section#course-info > h2.anchored")
    # Note: we could have alternatively selected the entire course
    # section, then pulled out the h2 element, like this:
    #course_info = page.query_selector("section#course-info")
    #info_details = course_info.query_selector("h2.anchored")
    # But the first way is more direct, as we only need the h2 element.
    print(info_details.inner_text())
    next_element = info_details.query_selector('~ *')
    print(next_element.inner_text())


    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Using the next selector is useful if the information you’re trying to scrape doesn’t have a unique identifier, but something preceeding it (like a section header) does. It’s also useful if you’re not sure if, or don’t want to assume that, the website has a particular CSS selector for the information you want to retrieve.

CautionCode Challenge 6.2.1

Scrape all the course info from the course syllabus:

https://su-ist356-m003-spring-2026.github.io/course-home/syllabus.html

Don’t assume any CSS identifier for the course info, just step through the section until you have retrieve it all.

Hint: Running query_selector('~ *') on the last child element in a parent element (like a section) will return None.

from playwright.sync_api import Playwright, sync_playwright


def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://su-ist356-m003-spring-2026.github.io/course-home/syllabus.html")
    
    # ---------------------
    # Let's get course info from the syllabus
    info_details = page.query_selector("section#course-info > h2.anchored")
    print(info_details.inner_text())
    next_element = info_details.query_selector('~ *')
    # The following while loop will continue until next_element is None. That
    # will happen once we've retrieved all available info from the section.
    while next_element:
        print(next_element.inner_text())
        next_element = next_element.query_selector('~ *')
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Downloading an Image

You can use playwright to download an image by getting the src attribute.

Here’s an example in which we download the Syracuse University logo from the university’s website:

Note: the downloaded file is an SVG file.

# pw-image_example.py

from playwright.sync_api import Playwright, sync_playwright
import requests

def download_image(url): 
    filename = url.split("/")[-1]
    response = requests.get(url) 
    with open(filename, 'wb') as file: 
        file.write(response.content)
    return filename

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    site = "https://www.syracuse.edu/"
    page.goto(site)
    # ---------------------
    image = page.query_selector("a.site-header-logo-link > img")
    image_source = image.get_attribute("src")
    print(f"Downloading: {image_source}")
    filename = download_image(image_source)
    print(f"Saved to: {filename}")
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Playwright Codegen

Playwright has a codegen feature that can help you generate code to interact with a webpage.

python -m playwright codegen 

For example, using code gen, we can generate the code needed to search for IST 356 in the course catalog:

# pw-codegen_example.py
import re
from playwright.sync_api import Playwright, sync_playwright, expect


def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://coursecatalog.syracuse.edu/course-search/")
    page.get_by_role("textbox", name="Keyword").fill("IST 356")
    page.get_by_role("button", name="SEARCH").click()
    page.get_by_role("link", name="IST 356 Programming").click()
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Once we have the code necessary to load the page, we can then use the Inspect tool in the open Chromium page to get whatever elements we want. For instance, inspecting the course description shows that it is in <div class="section__content"> which is a sub-element of <div class="section section--description">. We can therefore retrieve the description with:

# pw-ist356description.py
from playwright.sync_api import Playwright, sync_playwright

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://coursecatalog.syracuse.edu/course-search/")
    page.get_by_role("textbox", name="Keyword").click()
    page.get_by_role("textbox", name="Keyword").fill("IST 356")
    page.get_by_role("button", name="SEARCH").click()
    page.get_by_role("link", name="IST 356 Programming").click()
    # ---------------------
    descriptor = page.query_selector("div.section.section--description > div.section__content")
    course_description = descriptor.inner_text()
    print("Course Description:")
    print(course_description)
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Pausing for elements to load

If you run the above code, you may find that it fails with:

    course_description = descriptor.inner_text()
                         ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'inner_text'

The reason for this is due to a mismatch in how long it takes Python to execute the lines locally versus how long it takes the website to serve up the requested content. In this case, the line page.get_by_role("link", name="IST 356 Programming").click() completes as soon as the click command is executed by Playwright. Python then goes on to execute the next line (the query selector). However, the time it takes for the webpage to update given the request – i.e., for the course description to appear – may take longer. The result is that when the page.query_selector line is run the course description has not actually appeared yet in the website, and so the query selector returns None.

This is a common problem when working with interactive websites like this. There are a couple ways to deal with this.

Using sleep to pause execution

The easiest (but kludgy) solution is to simply force your code to pause for a preset amount of time before trying to select anything off the site. You can do that with the sleep command, which you need to import from the time module. Applying to our above example:

# pw-ist356description.py
from playwright.sync_api import Playwright, sync_playwright
from time import sleep


def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://coursecatalog.syracuse.edu/course-search/")
    page.get_by_role("textbox", name="Keyword").click()
    page.get_by_role("textbox", name="Keyword").fill("IST 356")
    page.get_by_role("button", name="SEARCH").click()
    page.get_by_role("link", name="IST 356 Programming").click()
    # ---------------------
    # pause for a second to let the page fully load
    sleep(1)
    descriptor = page.query_selector("div.section.section--description > div.section__content")
    course_description = descriptor.inner_text()
    print("Course Description:")
    print(course_description)
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

The sleep(1) line will cause the code to pause for 1 second. That (hopefully) is enough time for the website to finish loading, so that the subsequent lines will complete.

Note

As discussed below, using sleep to try to wait for an element to load is not a great solution. However, sleep is useful for debugging. As you’ve probably noticed, when your run a script, the browser that’s loaded will disappear as soon as the code block is done executing. That can be very quick; too quick for you to see what it did. By sticking a sleep in your code for some longer period of time (say 60 seconds) you can get the browser to persist for awhile so you can see what was loaded, and inspect the page. This can help you debug any issues. Just remember to remove the sleep when you’re done debugging!

Better: Waiting for the selector to appear

Using sleep works, but is not a great solution. First, you don’t know how long it will take to load the element you need. This can be random, depending on your internet connection and the server’s load at any given time. The time you hardcode may not always be long enough, or may be too long, in which case your program is wasting time.

A better solution is to use Playwright’s wait_for_selector method to wait for the desired element to appear. Here’s how we can use that in our code above:

# pw-ist356description.py
from playwright.sync_api import Playwright, sync_playwright

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://coursecatalog.syracuse.edu/course-search/")
    page.get_by_role("textbox", name="Keyword").click()
    page.get_by_role("textbox", name="Keyword").fill("IST 356")
    page.get_by_role("button", name="SEARCH").click()
    page.get_by_role("link", name="IST 356 Programming").click()
    # ---------------------
    element = "div.section.section--description > div.section__content"
    # note: the timeout is in milliseconds
    page.wait_for_selector(element, timeout=10000)
    descriptor = page.query_selector(selector)
    course_description = descriptor.inner_text()
    print("Course Description:")
    print(course_description)
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

The page.wait_for_selector(element, timeout=10000) will cause the code to wait until the desired element appears on the page before continuing to the next line. This can be dangerous: if the element simply doesn’t exist (say, you have a bug, or the website changed), your code could hang there. For this reason, it’s good to provide a timeout. This will cause the code to raise an error if it takes longer for the element to appear than the specified timeout. Note that the timeout is in milliseconds; by specifying timeout=10000 we’ve set the timeout to be 10 seconds.

Best: Using a locator

If you click on the documentation page for wait_for_selector you’ll see that using wait_for_selector is discouraged. (There is a discussion on Stackoverflow here about why.)

The best solution is instead to use locators. Locators are a more advanced way to location elements on a page. The code created by codegen made heavy use of them: all of those page.get_by_role lines are locators. You can do the same thing for the description. In fact, you can get codegen to give you the appropriate line of code by simply clicking on the “Description” header. Doing so in codegen, you’ll see the line page.get_by_role("heading", name="Description").click() appear in the Playwright inspector.

Locators have auto wait built into them, so you don’t need to add any additional lines to get your code to wait for it. Our code therefore looks like:

# pw-ist356description.py
from playwright.sync_api import Playwright, sync_playwright

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://coursecatalog.syracuse.edu/course-search/")
    page.get_by_role("textbox", name="Keyword").fill("IST 356")
    page.get_by_role("button", name="SEARCH").click()
    page.get_by_role("link", name="IST 356 Programming").click()
    page.get_by_role("heading", name="Description").click()
    # ---------------------
    element = "div.section.section--description > div.section__content"
    descriptor = page.query_selector(element)
    course_description = descriptor.inner_text()
    print("Course Description:")
    print(course_description)
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)
CautionCode Challenge 6.2.2

Use the playwright codegen to extract the SU football schedule for 2023 from https://cuse.com. Output the table as a CSV.

# get_su_football_schedule.py
from io import StringIO
from playwright.sync_api import Playwright, sync_playwright, expect
import pandas as pd

def run(playwright: Playwright, year) -> str:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto(f"https://cuse.com/sports/football/schedule/{year}")
    page.get_by_role("tab", name="Table View not selected").click()
    # clicking on the date column ensures that the table is loaded before we try to parse it
    page.locator("[data-test-id=\"s-table__root\"]").get_by_text("Date").click()
    # According to the Pandas docs, we now need to wrap content in StringIO
    # before passing to read_html
    dfs = pd.read_html(StringIO(page.content()))
    context.close()
    browser.close()
    return dfs[0]


with sync_playwright() as playwright:
    df = run(playwright, year=2023)
    print(df.to_csv())