Damilare Jolayemi Damilare is an enthusiastic problem-solver who enjoys building whatever works on the computer. He has a knack for slapping his keyboards till something works. When he's not talking to his laptop, you'll find him hopping on road trips and sharing moments with his friends, or watching shows on Netflix.

Build a Python web scraper with Beautiful Soup

7 min read 2078

Build a Python web scraper with Beautiful Soup

If you spend some time in the technology space, you’ll probably come across the terms “web scraping” and “web scrapers”. But do you know what they are, how they work, or how to build one for yourself?

If your answer to any of those questions is no, read on as we’ll be covering everything about web scraping in this article. You will also get a chance to build one using Python and the Beautiful Soup library.

What is web scraping?

Web scraping refers to extracting and harvesting data from websites via the Hypertext Transfer Protocol (HTTP) in an automated fashion by using a script or program considered a web scraper.

A web scraper is a software application capable of accessing resources on the internet and extracting required information. Often, web scrapers can structure and organize the collected data and store it locally for future use.

Some standard web scraping tools include:

You might be wondering why anybody might be interested in using a web scraper. Here are some common use cases:

  • Generating leads for marketing purposes
  • Monitoring and comparing prices of products in multiple stores
  • Data analysis and academic research
  • Gathering data for training machine learning models
  • Analyzing social media profiles
  • Information gathering and cybersecurity
  • Fetching financial data (stocks, cryptocurrency, forex rates, etc.)

Challenges faced in web scraping

Web scraping sounds like it’d be a go-to solution when you need data, but it’s not always easy to set up for multiple reasons. Let’s look at some of them.

1. Every website has a different structure

People build websites using different teams, tools, designs, and sections, making everything about one given website different from another one. This implies that if you create a web scraper for a website, you’d have to build a separate version to be fully compatible with another website — except for when they share very similar content or your web scraper uses clever heuristics.

2. Websites frequently change their designs and structures

The durability of a web scraper is a significant problem. You can have a web scraper that works perfectly today, but it will seemingly suddenly break because the website you’re extracting data from updated its design and structure. Thus, you’ll also have to frequently make changes to your scraper logic to keep it running.

3. Some websites implement bot prevention measures

Over the years, people started abusing their power with web scrapers to perform malicious activities. Web developers retaliated against this move by implementing measures that prevent their data from being scraped. Some of these measures include:

We made a custom demo for .
No really. Click here to check it out.

  • Adding CAPTCHA when submitting forms
  • Using Cloudflare to authorize visitors
  • Validating user agents of visitors
  • Rejecting proxy requests
  • Throttling web resources
  • IP address safelisting/blocklisting

4. Rate limiting techniques can disturb scraping

For short, rate limiting is a technique that controls how much traffic is processed by a system by setting usage caps for its operations. In this context, the operation allows visitors to access content hosted on the website.

Rate limiting becomes troublesome when you are trying to scrape a lot of data from multiple website pages.

5. Dynamic websites are harder to scrape

A dynamic website uses scripts to generate its content on the website. Often, it fetches data from an external source and prefills the page with it.

If your web scraper makes a GET request to the webpage and scrapes the returned data, it will not function as expected because it is not running the scripts on the website. The solution here is to use tools like Selenium that spin up a browser instance and execute the required scripts.

Basic concepts

Before we get into our in-depth example, let’s make sure we’ve set up properly and understand a few basic concepts about web scraping in practice.

To follow and understand this tutorial, you will need the following:

  • Working knowledge of HTML and Python
  • Python 3.6 or later installed on your machine
  • A Python development environment (e.g., text editor, IDE)
  • Beautiful Soup ≥4.0

First, install Beautiful Soup, a Python library that provides simple methods for you to extract data from HTML and XML documents.

In your terminal, type the following:

pip install beautifulsoup4

Parse an HTML document using Beautiful Soup

Let’s explore a block of Python code that uses Beautiful Soup to parse and navigate an HTML document:

from bs4 import BeautifulSoup

# define a HTML document
html = "<!DOCTYPE html><html><head><title>This is the title of a website</title></head><body><h1 id='heading-1'>This is the main heading</h1><h2 id='heading-2'>This is a subheading</h2><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p><ul><li class='list-item'>First</li><li class='list-item'>Second</li><li class='list-item'>Third</li></ul></body></html>"

# parse the HTML content with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

# print the HTML in a beautiful form
print(soup.prettify())

We imported the Beautiful Soup library into a script and created a BeautifulSoup object from our HTML document in the code above. Then, we used the prettify() method to display the HTML content in an adequately indented form. Below is the output:
The Beautiful Soup object we created in HTML form

Extract HTML elements by their tag names

Next, let’s extract some of the HTML tags in our document. Beautiful Soup provides a couple of methods that allow you to extract elements.

Let’s look at an example:

# getting the title element of the HTML
print(soup.title)

# getting the first h1 element in the HTML
print(soup.h1)

And its output:
Extract the HTML elements by tag names

Beautiful Soup provides a find() method that allows you to extract elements with specific criteria. Let’s see how to use it:

# getting the first h2 element in the HTML
print(soup.find("h2"))

# getting the first p element in the HTML
print(soup.find("p"))

And what the output looks like:
The output for extracting HTML elements with specific criteria

Beautiful Soup also provides a find_all() method to extract every element with a specific tag as a list, instead of getting only the first occurrence. Let’s see its usage:

# getting all the li elements in the HTML
print(soup.find_all("li"))

The output when you extract elements by specific tags as a list

Extract HTML elements by their IDs

You might want to extract HTML elements that have a specific ID attached to them. The find() method allows you to supply an ID to filter its search results.

Let’s see how to use it:

# getting the h1 element with the heading-1 id
print(soup.find("h1", id="heading-1"))

# getting the h2 element with the heading-2 id
print(soup.find("h2", {"id": "heading-2"}))

And below is the output:
The output when extracting HTML elements by their IDs

Extract HTML elements with their class

Beautiful Soup also lets you extract HTML elements with a specific class by supplying the find() and find_all() methods with appropriate parameters to filter their search results. Let’s see its usage:

# getting the first li element with the list-item class
print(soup.find("li", {"class": "list-item"}))

# getting all the li elements with the list-item class
print(soup.find_all("li", {"class": "list-item"}))

Extracting HTML elements by their class

Access an element’s attributes and content

You might want to retrieve the values of the attributes and content of the elements you extract.

Luckily, Beautiful Soup provides functionalities for achieving this. Let’s see some examples:

# define a HTML document
html = "<a id='homepage' class='hyperlink' href='https://google.com'>Google</a>"

# parse the HTML content with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

# extract the a element in the HTML
element = soup.find("a")

# extract the element id
print("ID:", element["id"])

# extract the element class
print("class:", element["class"])

# extract the element href
print("href:", element["href"])

# extract the text contained in the element
print("text:", element.text)
print("text:", element.get_text())

Access the element's attributes and contents

Let’s build a web scraper

Now that we have covered the basics of web scraping with Python and Beautiful Soup, let’s build a script that scrapes and displays cryptocurrency information from CoinGecko.

Step 1: Install dependencies

You need to install the Requests library for Python to extend the functionalities of your scripts to send HTTP/1.1 requests extremely easily.

In your terminal, type the following:

pip install requests

Step 2: Fetch CoinGecko HTML data

Now, we’ll retrieve CoinGecko’s HTML content to parse and extract the required information with Beautiful Soup. Create a file named scraper.py and save the code below in it:

import requests


def fetch_coingecko_html():
    # make a request to the target website
    r = requests.get("https://www.coingecko.com")
    if r.status_code == 200:
        # if the request is successful return the HTML content
        return r.text
    else:
        # throw an exception if an error occurred
        raise Exception("an error occurred while fetching coingecko html")

Step 3: Study the CoinGecko website structure

Remember: we highlighted that every website has a different structure, so we need to study how CoinGecko is structured and built before building a web scraper.

Open https://coingecko.com in your browser so we have a view of the website we are scraping (the below screenshot is from my Firefox browser):
The CoinGecko website in a Firefox browser

Since we want to scrape cryptocurrency information, open the Inspector tab in the Web Developer Toolbox and view the source code of any cryptocurrency element from the information table:
Bitcoin's price according to CoinGecko

The source code of the Bitcoin element

From the source code above, we can notice the following things about the HTML tags we’re inspecting:

  • Every cryptocurrency element is stored in a tr tag contained in a div tag with coin-table class
  • The cryptocurrency name is stored in a td tag with coin-name class
  • The price is stored in a td tag with td-price and price classes
  • The price changes are stored in a td tag with td-change1h, td-change24h, and td-change7d classes
  • The trading volume and market cap are stored in a td tag with td-liquidity_score and td-market_cap classes

Step 4: Extract the data with Beautiful Soup

Now that we have studied the structure of CoinGecko’s website, let’s use Beautiful Soup to extract the data we need.

Add a new function to the scraper.py file:

from bs4 import BeautifulSoup

def extract_crypto_info(html):
    # parse the HTML content with Beautiful Soup
    soup = BeautifulSoup(html, "html.parser")

    # find all the cryptocurrency elements
    coin_table = soup.find("div", {"class": "coin-table"})
    crypto_elements = coin_table.find_all("tr")[1:]

    # iterate through our cryptocurrency elements
    cryptos = []
    for crypto in crypto_elements:
        # extract the information needed using our observations
        cryptos.append({
            "name": crypto.find("td", {"class": "coin-name"})["data-sort"],
            "price": crypto.find("td", {"class": "td-price"}).text.strip(),
            "change_1h": crypto.find("td", {"class": "td-change1h"}).text.strip(),
            "change_24h": crypto.find("td", {"class": "td-change24h"}).text.strip(),
            "change_7d": crypto.find("td", {"class": "td-change7d"}).text.strip(),
            "volume": crypto.find("td", {"class": "td-liquidity_score"}).text.strip(),
            "market_cap": crypto.find("td", {"class": "td-market_cap"}).text.strip()
        })

    return cryptos

Here, we created an extract_crypto_info() function that extracts all the cryptocurrency information from CoinGecko’s HTML content. We used the find(), find_all(), and .text methods from Beautiful Soup to navigate CoinGecko’s data and extract what we needed.

Step 5: Display the extracted data

Let’s use the function we created above to complete our scraper and display cryptocurrency information in the terminal. Add the following code to the scraper.py file:

# fetch CoinGecko's HTML content
html = fetch_coingecko_html()

# extract our data from the HTML document
cryptos = extract_crypto_info(html)

# display the scraper results
for crypto in cryptos:
    print(crypto, "\n")

Once you run that, you’ll see the following:
The display of the extracted data

You can also decide to save the results in a JSON file locally:

import json

# save the results locally in JSON
with open("coingecko.json", "w") as f:
    f.write(json.dumps(cryptos, indent=2))

The same extracted data displayed in JSON format

Conclusion

In this article, you learned about web scraping and web scrapers, their uses, the challenges associated with web scraping, and how to use the Beautiful Soup library. We also explored multiple implementation code snippets and built a web scraper to retrieve cryptocurrency information from CoinGecko with Python and Beautiful Soup.

The source code of the cryptocurrency web scraper is available as a GitHub Gist. You can head over to the official Beautiful Soup documentation to explore more functionalities it provides and build amazing things with the knowledge acquired from this tutorial.

: Full visibility into your web apps

LogRocket is a frontend application monitoring solution that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.

In addition to logging Redux actions and state, LogRocket records console logs, JavaScript errors, stacktraces, network requests/responses with headers + bodies, browser metadata, and custom logs. It also instruments the DOM to record the HTML and CSS on the page, recreating pixel-perfect videos of even the most complex single-page apps.

.
Damilare Jolayemi Damilare is an enthusiastic problem-solver who enjoys building whatever works on the computer. He has a knack for slapping his keyboards till something works. When he's not talking to his laptop, you'll find him hopping on road trips and sharing moments with his friends, or watching shows on Netflix.

Leave a Reply