Web Scraping with BeautifulSoup
Web scraping is the technique of extracting data from websites automatically. While APIs are the preferred way to access data, many sites still do not offer one. BeautifulSoup is Python’s most popular library for parsing HTML and XML documents, making web scraping accessible even to beginners.
This guide covers the essentials: installing the library, parsing HTML, navigating the document tree, and extracting the data you need.
Installing BeautifulSoup
BeautifulSoup works with a parser to convert HTML into a navigable tree. You will need the library itself and a parser:
pip install beautifulsoup4 lxml
The lxml parser is faster and more forgiving than Python’s built-in html.parser. It handles malformed HTML more gracefully.
Parsing Your First Page
Start by fetching a webpage and parsing it with BeautifulSoup:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "lxml")
print(soup.title) # <title>Example Domain</title>
print(soup.title.text) # Example Domain
The BeautifulSoup object represents the parsed document. You can now search and navigate the tree.
Finding Elements
BeautifulSoup provides several methods to locate elements:
find() and find_all()
The most common methods search by tag name:
# Find the first matching element
first_paragraph = soup.find("p")
print(first_paragraph.text)
# Find all matching elements
all_paragraphs = soup.find_all("p")
for p in all_paragraphs:
print(p.text)
Searching by Class, ID, and Attributes
Refine your searches with CSS selectors passed as keyword arguments:
# By class (note: class is a Python keyword, use class_=)
content = soup.find("div", class_="content")
# By ID
header = soup.find("header", id="main-header")
# By any attribute
links = soup.find_all("a", href=True) # All <a> tags with href
Using CSS Selectors
For complex selections, use the select() method with CSS selectors:
# Select all paragraphs inside .article-content
articles = soup.select(".article-content p")
# Select nested elements
nav_links = soup.select("nav ul li a")
# Select by attribute
email = soup.select_one('a[href^="mailto:"]')
This is often the fastest way to extract data when you are familiar with CSS selectors.
Navigating the Tree
Once you have an element, traverse the tree using properties:
# Parent and siblings
parent = element.parent
next_sibling = element.next_sibling
previous_sibling = element.previous_sibling
# Children
children = element.children
for child in element.children:
print(child)
Be careful with whitespace-only text nodes. Use .next_sibling.stripped_strings to filter them out.
Extracting Data
Beyond text, extract attributes, links, and other data:
# Get attribute values
link = soup.find("a")
url = link.get("href") # Using .get()
url = link["href"] # Direct access
# Get all links from the page
for link in soup.find_all("a"):
href = link.get("href")
text = link.get_text(strip=True)
if href and text:
print(f"{text}: {href}")
Handling Missing Data
Some elements may not exist. Handle this gracefully:
# Using try/except
try:
email = soup.find("a", href=True)["href"]
except TypeError:
email = None
# Or using .get() with default
description = soup.find("meta", attrs={"name": "description"})
desc_text = description.get("content", "") if description else ""
Real-World Example: Scraping a Blog
Here is a practical example that extracts article titles and links from a blog index:
from bs4 import BeautifulSoup
import requests
def scrape_blog(url):
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
articles = []
for article in soup.select("article.post"):
title_elem = article.select_one("h2.entry-title")
link_elem = article.select_one("a")
if title_elem and link_elem:
articles.append({
"title": title_elem.get_text(strip=True),
"url": link_elem.get("href"),
})
return articles
# Usage
posts = scrape_blog("https://example.com/blog")
for post in posts:
print(f'{post["title"]} -> {post["url"]}')
This pattern—select containers, then extract fields from each—works for most scraping tasks.
Handling Pagination
When data spans multiple pages, loop through pages:
def scrape_all_pages(base_url, max_pages=5):
all_articles = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
response = requests.get(url)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "lxml")
articles = soup.select("article")
if not articles:
break
for article in articles:
title = article.select_one("h2").get_text(strip=True)
all_articles.append(title)
return all_articles
Check for a 200 status code on each request to detect when you have reached the last page.
Respectful Scraping
Web scraping puts load on servers. Follow these practices:
- Check robots.txt before scraping:
https://example.com/robots.txt - Add delays between requests:
time.sleep(1) - Set a User-Agent header to identify your script
- Cache results to avoid repeated requests
headers = {
"User-Agent": "MyWebScraper/1.0 (contact@example.com)"
}
response = requests.get(url, headers=headers)
Many sites block scrapers that do not identify themselves. A descriptive User-Agent reduces the chance of being blocked.
Common Pitfalls
Watch out for these gotchas:
- Dynamic content loaded via JavaScript: BeautifulSoup cannot see it. Use Selenium or Playwright instead.
- Encoding issues: Pass
response.contentinstead ofresponse.textto BeautifulSoup for explicit decoding. - Relative URLs: Resolve them with
urllib.parse.urljoin(base, relative).
See Also
- requests-library — Making HTTP requests for fetching pages
- regex-guide — Cleaning extracted text with regular expressions
- json-module — Parsing JSON data from APIs
Extracting Table Data
Tables are common on websites and contain structured data. Extracting table data into a list of dictionaries is a frequent scraping task:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/table-page")
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table")
headers = [th.get_text(strip=True) for th in table.select("thead th")]
rows = []
for tr in table.select("tbody tr"):
cells = [td.get_text(strip=True) for td in tr.select("td")]
rows.append(dict(zip(headers, cells)))
for row in rows:
print(row)
This pattern converts HTML tables into structured data you can export to CSV or process further.