Web Scraping with BeautifulSoup

· 4 min read · Updated March 14, 2026 · beginner
python web scraping beautifulsoup html

Web scraping is the technique of extracting data from websites automatically. While APIs are the preferred way to access data, many sites still do not offer one. BeautifulSoup is Python’s most popular library for parsing HTML and XML documents, making web scraping accessible even to beginners.

This guide covers the essentials: installing the library, parsing HTML, navigating the document tree, and extracting the data you need.

Installing BeautifulSoup

BeautifulSoup works with a parser to convert HTML into a navigable tree. You will need the library itself and a parser:

pip install beautifulsoup4 lxml

The lxml parser is faster and more forgiving than Python’s built-in html.parser. It handles malformed HTML more gracefully.

Parsing Your First Page

Start by fetching a webpage and parsing it with BeautifulSoup:

from bs4 import BeautifulSoup
import requests

response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "lxml")

print(soup.title)        # <title>Example Domain</title>
print(soup.title.text)  # Example Domain

The BeautifulSoup object represents the parsed document. You can now search and navigate the tree.

Finding Elements

BeautifulSoup provides several methods to locate elements:

find() and find_all()

The most common methods search by tag name:

# Find the first matching element
first_paragraph = soup.find("p")
print(first_paragraph.text)

# Find all matching elements
all_paragraphs = soup.find_all("p")
for p in all_paragraphs:
    print(p.text)

Searching by Class, ID, and Attributes

Refine your searches with CSS selectors passed as keyword arguments:

# By class (note: class is a Python keyword, use class_=)
content = soup.find("div", class_="content")

# By ID
header = soup.find("header", id="main-header")

# By any attribute
links = soup.find_all("a", href=True)  # All <a> tags with href

Using CSS Selectors

For complex selections, use the select() method with CSS selectors:

# Select all paragraphs inside .article-content
articles = soup.select(".article-content p")

# Select nested elements
nav_links = soup.select("nav ul li a")

# Select by attribute
email = soup.select_one('a[href^="mailto:"]')

This is often the fastest way to extract data when you are familiar with CSS selectors.

Once you have an element, traverse the tree using properties:

# Parent and siblings
parent = element.parent
next_sibling = element.next_sibling
previous_sibling = element.previous_sibling

# Children
children = element.children
for child in element.children:
    print(child)

Be careful with whitespace-only text nodes. Use .next_sibling.stripped_strings to filter them out.

Extracting Data

Beyond text, extract attributes, links, and other data:

# Get attribute values
link = soup.find("a")
url = link.get("href")          # Using .get()
url = link["href"]              # Direct access

# Get all links from the page
for link in soup.find_all("a"):
    href = link.get("href")
    text = link.get_text(strip=True)
    if href and text:
        print(f"{text}: {href}")

Handling Missing Data

Some elements may not exist. Handle this gracefully:

# Using try/except
try:
    email = soup.find("a", href=True)["href"]
except TypeError:
    email = None

# Or using .get() with default
description = soup.find("meta", attrs={"name": "description"})
desc_text = description.get("content", "") if description else ""

Real-World Example: Scraping a Blog

Here is a practical example that extracts article titles and links from a blog index:

from bs4 import BeautifulSoup
import requests

def scrape_blog(url):
    response = requests.get(url)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "lxml")
    articles = []
    
    for article in soup.select("article.post"):
        title_elem = article.select_one("h2.entry-title")
        link_elem = article.select_one("a")
        
        if title_elem and link_elem:
            articles.append({
                "title": title_elem.get_text(strip=True),
                "url": link_elem.get("href"),
            })
    
    return articles

# Usage
posts = scrape_blog("https://example.com/blog")
for post in posts:
    print(f'{post["title"]} -> {post["url"]}')

This pattern—select containers, then extract fields from each—works for most scraping tasks.

Handling Pagination

When data spans multiple pages, loop through pages:

def scrape_all_pages(base_url, max_pages=5):
    all_articles = []
    
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        response = requests.get(url)
        
        if response.status_code != 200:
            break
            
        soup = BeautifulSoup(response.text, "lxml")
        articles = soup.select("article")
        
        if not articles:
            break
            
        for article in articles:
            title = article.select_one("h2").get_text(strip=True)
            all_articles.append(title)
    
    return all_articles

Check for a 200 status code on each request to detect when you have reached the last page.

Respectful Scraping

Web scraping puts load on servers. Follow these practices:

  • Check robots.txt before scraping: https://example.com/robots.txt
  • Add delays between requests: time.sleep(1)
  • Set a User-Agent header to identify your script
  • Cache results to avoid repeated requests
headers = {
    "User-Agent": "MyWebScraper/1.0 (contact@example.com)"
}
response = requests.get(url, headers=headers)

Many sites block scrapers that do not identify themselves. A descriptive User-Agent reduces the chance of being blocked.

Common Pitfalls

Watch out for these gotchas:

  • Dynamic content loaded via JavaScript: BeautifulSoup cannot see it. Use Selenium or Playwright instead.
  • Encoding issues: Pass response.content instead of response.text to BeautifulSoup for explicit decoding.
  • Relative URLs: Resolve them with urllib.parse.urljoin(base, relative).

See Also

Extracting Table Data

Tables are common on websites and contain structured data. Extracting table data into a list of dictionaries is a frequent scraping task:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/table-page")
soup = BeautifulSoup(response.text, "lxml")

table = soup.find("table")
headers = [th.get_text(strip=True) for th in table.select("thead th")]

rows = []
for tr in table.select("tbody tr"):
    cells = [td.get_text(strip=True) for td in tr.select("td")]
    rows.append(dict(zip(headers, cells)))

for row in rows:
    print(row)

This pattern converts HTML tables into structured data you can export to CSV or process further.