Shahin Rostami Dr. Shahin Rostami is a data scientist with software engineering skills that have been honed over two decades. He has experience in both industry and academia, where he's demonstrated innovation and leadership.

Web scraping in Python with lxml and pandas

6 min read 1916

Python Logo Over a Grassy Background

Data science and visualization tutorials often begin by referencing an existing dataset. However, it’s often the case that we need or want to create our own dataset. So let’s take a few steps back and think about how we can create one using Python and a few of its popular packages!

import requests
import lxml.html
import pandas as pd

Let’s begin with a quick tour of the packages themselves:

  • Requests, a simple HTTP library, and one of the most downloaded Python packages in existence
  • lxml, a feature-rich library for processing XML and HTML
  • pandas, a powerful data manipulation library with useful structures

None of these packages are esoteric, difficult to use, or difficult to get access to. It’s safe to say that learning about them is a worthwhile investment.

So, let’s bring these tools together and automate the creation of a movie dataset based on the current IMDb Top 1000.

Sourcing the data for Python

Here’s the data from the IMDb Top 1000 list.

By default, we’re presented with 50 movies per page that have been ordered by (ascending) popularity.

With some modifications to our request, we can change it to 200 movies per page that have been ordered by (descending) user rating.

If we visit the page and inspect the source of the movie items, we can see they each appear within div elements with the classes lister-item and mode-advanced.

Div Elements of Shawshank Redemption

The XPath to locate these nodes is //div[contains(@class, 'lister-item mode-advanced')].

We made a custom demo for .
No really. Click here to check it out.

Using our web inspector, let’s get the XPath for movie features of interest. The idea here is to find points of reference for navigating the HTML document that allow us to extract the data we need. Relative to each movies node above, these are:

  • URL: .//h3[@class="lister-item-header"]//a/@href
  • Name: .//h3[@class="lister-item-header"]//a/text()
  • Thumbnail: .//div[@class="lister-item-image float-left"]//a//img/@loadlate
  • Rating: .//div[@class="inline-block ratings-imdb-rating"]//strong/text()
  • Genre: .//span[@class="genre"]//text()
  • Gross: .//p[@class="sort-num_votes-visible"]//span[last()]/text()

Extracting the data for a single movie

Let’s put our preparation into practice and get the data for the second-place movie, The Shawshank Redemption, into a few variables.

First, we’ll use the requests package to retrieve the HTML source for the first page of 200 movies.

url = "https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=200"
response = requests.get(url)
content = response.content

With that, we’ve retrieved a response to our GET request and stored the content in our content variable. We could check if our response was successful by looking at the response code, where a response code of 200 means everything went OK.

print(response.status_code)
200

Looking good! Next, we’ll parse our HTML content with lxml so that we can start processing it.

html = lxml.html.fromstring(content)

We can start using our XPath expressions from earlier to select nodes. Let’s select every parent div element that we know stores the data for our movies.

items = html.xpath("//div[contains(@class, 'lister-item mode-advanced')]")

We should expect 200 movies per page, so let’s double-check if that corresponds to the length of the list we just selected:

print(len(items))
200

Great! Now let’s select the div that contains all of the data for The Shawshank Redemption. We know it placed second in the IMDb Top 1000, so we’ll try indexing the second item.

item = items[1]

Finally, let’s extract the data for every feature we prepared for earlier.

name = item.xpath('.//h3[@class="lister-item-header"]//a/text()')[0]
thumbnail = item.xpath('.//div[@class="lister-item-image float-left"]//a//img/@loadlate')[0]
rating = item.xpath('.//div[@class="inline-block ratings-imdb-rating"]//strong/text()')[0]
genre = item.xpath('.//span[@class="genre"]//text()')[0].strip()
gross = item.xpath('.//p[@class="sort-num_votes-visible"]//span[last()]/text()')[0].strip()
gross = gross if "$" in gross else "N/A"
url = "https://www.imdb.com" + item.xpath('.//h3[@class="lister-item-header"]//a/@href')[0]

To clean the data, some additional wrangling for the features is required:

  • url: the paths were relative, so they have been prefixed with the protocol and domain name, https://www.imdb.com
  • genre and gross had unwanted whitespace before/after the desired string, so these were stripped out with Python’s .strip()
  • gross was not always present, but when it was, it was always in dollars. So, we’ll check for the presence of the $ character and list the gross as not available if it’s missing

Let’s print all these out to see how we did!

print(f"{name=}")
print(f"{thumbnail=}")
print(f"{rating=}")
print(f"{genre=}")
print(f"{gross=}")
print(f"{url=}")

name='The Shawshank Redemption'
thumbnail='https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEt[email protected]._V1_UX67_CR0,0,67,98_AL_.jpg'
rating='9.3'
genre='Drama'
gross='$28.34M'
url='https://www.imdb.com/title/tt0111161/'

Perfect! We’ve also taken this opportunity to highlight a useful feature of Python f-strings, which allows us to print the expression, the equals sign, and then the evaluated expression!

Automation and creating our DataFrame

We’ve done everything once for a single movie, so let’s move on to automating things for all 1000 of them.

When visiting the IMDB Top 1000 pages above and clicking the Next button, we are taken to this page.

We can see the key difference here is the inclusion of a start parameter, which is set to start at 201. This will be the key to automating the retrieval of all 1000 movies, so we will parameterize this in our upcoming loop.

items = []
for start in range(1,1000,200):
    url = f"https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=200&start={start}&ref_=adv_nxt"
    response = requests.get(url)
    content = response.content
    html = lxml.html.fromstring(content)
    items += html.xpath("//div[contains(@class, 'lister-item mode-advanced')]") 

Here, we’ve updated the same code from earlier by placing it in a loop that goes from 1 to 1000 in intervals of 200. This value is fed into the start parameter that we’ve just discovered, and we’ve appended the movies to our items list 200 at a time. Let’s check the length of the list to be sure.

print(len(items))
1000

We’re nearly there! Now for our DataFrame. Let’s start by creating an empty DataFrame with our desired columns.

data = pd.DataFrame(columns=['name', 'thumbnail', 'rating', 'genre', 'gross', 'thumbnail', 'url'])

All that’s left now is to loop through our retrieved movie items and append our data to our DataFrame.

for item in items:
    name = item.xpath('.//h3[@class="lister-item-header"]//a/text()')[0]
    thumbnail = item.xpath('.//div[@class="lister-item-image float-left"]//a//img/@loadlate')[0]
    rating = item.xpath('.//div[@class="inline-block ratings-imdb-rating"]//strong/text()')[0]
    genre = item.xpath('.//span[@class="genre"]//text()')[0].strip()
    gross = item.xpath('.//p[@class="sort-num_votes-visible"]//span[last()]/text()')[0].strip()
    gross = gross if "$" in gross else "N/A"
    url = "https://www.imdb.com" + item.xpath('.//h3[@class="lister-item-header"]//a/@href')[0]


    data = data.append({'name': name, 'thumbnail': thumbnail, 'rating': rating, 'genre': genre, 'gross': gross, 'url': url}, ignore_index=True)

We can see a sample of our newly generated dataset by displaying the first 10 rows.

data.head(10)
“`

name

thumbnail

rating

genre

gross

thumbnail

url

0 Jai Bhim https://m.media-amazon.com/images/M/MV5BY2Y5ZW… 9.5 Crime, Drama N/A https://m.media-amazon.com/images/M/MV5BY2Y5ZW… https://www.imdb.com/title/tt15097216/
1 The Shawshank Redemption https://m.media-amazon.com/images/M/MV5BMDFkYT… 9.3 Drama $28.34M https://m.media-amazon.com/images/M/MV5BMDFkYT… https://www.imdb.com/title/tt0111161/
2 The Godfather https://m.media-amazon.com/images/M/MV5BM2MyNj… 9.2 Crime, Drama $134.97M https://m.media-amazon.com/images/M/MV5BM2MyNj… https://www.imdb.com/title/tt0068646/
3 Soorarai Pottru https://m.media-amazon.com/images/M/MV5BOGVjYm… 9.1 Drama N/A https://m.media-amazon.com/images/M/MV5BOGVjYm… https://www.imdb.com/title/tt10189514/
4 The Dark Knight https://m.media-amazon.com/images/M/MV5BMTMxNT… 9.0 Action, Crime, Drama $534.86M https://m.media-amazon.com/images/M/MV5BMTMxNT… https://www.imdb.com/title/tt0468569/
5 The Godfather: Part II https://m.media-amazon.com/images/M/MV5BMWMwMG… 9.0 Crime, Drama $57.30M https://m.media-amazon.com/images/M/MV5BMWMwMG… https://www.imdb.com/title/tt0071562/
6 12 Angry Men https://m.media-amazon.com/images/M/MV5BMWU4N2… 9.0 Crime, Drama $4.36M https://m.media-amazon.com/images/M/MV5BMWU4N2… https://www.imdb.com/title/tt0050083/
7 Sardar Udham https://m.media-amazon.com/images/M/MV5BZGFhNT… 8.9 Biography, Crime, Drama N/A https://m.media-amazon.com/images/M/MV5BZGFhNT… https://www.imdb.com/title/tt10280296/
8 The Lord of the Rings: The Return of the King https://m.media-amazon.com/images/M/MV5BNzA5ZD… 8.9 Action, Adventure, Drama $377.85M https://m.media-amazon.com/images/M/MV5BNzA5ZD… https://www.imdb.com/title/tt0167260/
9 Pulp Fiction https://m.media-amazon.com/images/M/MV5BNGNhMD… 8.9 Crime, Drama $107.93M https://m.media-amazon.com/images/M/MV5BNGNhMD… https://www.imdb.com/title/tt0110912/

All done! It may be useful to save this to a CSV file at this point.

data.to_csv('data.csv')

Analyzing our data

Let’s interrogate our data a little bit. We’ll start with a histogram of the ratings.

data.rating.hist()
<AxesSubplot:>

Ratings Histogram

We could also check out some summary statistics.

data.rating.astype(float).describe()

count    1000.000000
mean        7.968300
std         0.280292
min         7.600000
25%         7.700000
50%         7.900000
75%         8.100000
max         9.500000
Name: rating, dtype: float64

Bonus exercise!

As a bonus, we have some interesting data in the Genre column, which we could use to build a co-occurrence matrix.

We’ll use the itertools package to give us some extra functions for dealing with iterators.

import itertools

Now, let’s wrangle our genres! First, split the genres into lists.

data['genre'] = data['genre'].str.split(",")

Then, clean the data of any whitespace.

for index, row in data.iterrows():
    genre = [x.strip(' ') for x in row.genre]
    row.genre = genre

We’ll build a list of each unique genre and sort them alphabetically:

genres = [st for row in data.genre for st in row]
genres = set(genres)
genres = sorted(genres)

Then, we construct our co-occurrence matrix.

matrix = pd.DataFrame(0, columns=genres, index=genres)

for index, row in data.iterrows():
    if len(row.genre) == 1:
        matrix[row.genre[0]][row.genre[0]] += 1
    else:
        for genre in list(itertools.combinations(row.genre, 2)):
            matrix[genre[0]][genre[1]] += 1
            matrix[genre[1]][genre[0]] += 1

Which, if we’re curious, looks like the following.

matrix
“`

Action

Adventure

Animation

Biography

Comedy

Crime

Drama

Family

Fantasy

Film-Noir

Horror

Music

Musical

Mystery

Romance

Sci-Fi

Sport

Thriller

War

Western

Action 0 85 19 10 22 51 84 1 8 0 3 0 0 9 3 30 2 24 7 3
Adventure 85 0 55 9 56 4 59 15 21 1 2 0 0 5 3 30 0 10 6 7
Animation 19 55 0 3 34 4 23 15 10 0 0 0 0 1 1 2 0 0 1 0
Biography 10 9 3 0 9 20 100 4 0 0 0 7 0 0 1 0 8 3 1 0
Comedy 22 56 34 9 13 31 121 11 13 1 4 8 4 3 48 4 1 4 8 1
Crime 51 4 4 20 31 0 157 0 2 10 2 1 1 35 5 1 1 42 0 0
Drama 84 59 23 100 121 157 88 24 31 15 15 33 11 66 109 29 18 78 45 13
Family 1 15 15 4 11 0 24 0 14 0 0 0 4 0 0 1 2 0 0 0
Fantasy 8 21 10 0 13 2 31 14 0 0 2 2 1 4 8 1 0 0 1 0
Film-Noir 0 1 0 0 1 10 15 0 0 0 0 0 0 5 2 0 0 3 0 0
History 3 5 0 25 0 2 49 0 0 0 0 1 0 1 2 0 0 4 6 0
Horror 3 2 0 0 4 2 15 0 2 0 2 0 0 7 0 7 0 13 0 0
Music 0 0 0 7 8 1 33 0 2 0 0 0 3 1 9 0 0 0 0 0
Musical 0 0 0 0 4 1 11 4 1 0 0 3 0 0 2 0 1 0 0 0
Mystery 9 5 1 0 3 35 66 0 4 5 7 1 0 0 9 8 0 31 2 0
Romance 3 3 1 1 48 5 109 0 8 2 0 9 2 9 0 2 0 3 5 0
Sci-Fi 30 30 2 0 4 1 29 1 1 0 7 0 0 8 2 0 0 5 0 0
Sport 2 0 0 8 1 1 18 2 0 0 0 0 1 0 0 0 0 0 0 0
Thriller 24 10 0 3 4 42 78 0 0 3 13 0 0 31 3 5 0 1 3 1
War 7 6 1 1 8 0 45 0 1 0 0 0 0 2 5 0 0 3 0 1
Western 3 7 0 0 1 0 13 0 0 0 0 0 0 0 0 0 0 1 1 4

Notice the size: 21 rows × 21 columns

Finally, we will use Plotapi to create an interactive Chord diagram with our data!

from plotapi import Chord

Chord.set_license("your username", "your license key")

Chord(matrix.values.tolist(), genres, colors="movies").show()

Chord Diagram

Conclusion

There you have it! In this article, we have gone through the journey of dataset creation to visualization. We completed all these steps using Python, and relied entirely on popular and well-maintained packages, lxml and pandas.

Knowing how to create our own dataset from existing sources is exceptionally useful, and it’s sometimes the only option when an API or data dump has not been made available.

From here, we could grow our IMDb dataset by including the actors that starred in each movie, or visit another data source and practice our new data retrieval, processing, and visualization skills there. Here are some examples that use the same techniques: Co-occurrence of Pokemon Types, Co-occurrence of Animal Crossing Villager Species and Personality, and Retrieving JSON with the requests package.

: Full visibility into your web apps

LogRocket is a frontend application monitoring solution that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.

In addition to logging Redux actions and state, LogRocket records console logs, JavaScript errors, stacktraces, network requests/responses with headers + bodies, browser metadata, and custom logs. It also instruments the DOM to record the HTML and CSS on the page, recreating pixel-perfect videos of even the most complex single-page and mobile apps.

.
Shahin Rostami Dr. Shahin Rostami is a data scientist with software engineering skills that have been honed over two decades. He has experience in both industry and academia, where he's demonstrated innovation and leadership.

Leave a Reply