Web scraping is a tricky but necessary part of some applications. In this article, we’re going to explore some principles to keep in mind when writing a web scraper. We’ll also look at what tools Rust has to make writing a web scraper easier.
What we’ll cover:
Web scraping refers to gathering data from a webpage in an automated way. If you can load a page in a web browser, you can load it into a script and parse the parts you need out of it!
However, web scraping can be pretty tricky. HTML isn’t a very structured format, so you usually have to dig around a bit to find the relevant parts.
If the data you want is available in another way — either through some sort of API call, or in a structured format like JSON, XML, or CSV — it will almost certainly be easier to get it that way instead. Web scraping can be a bit of a last resort because it can be cumbersome and brittle.
The details of web scraping highly depend on the page you’re getting the data from. We’ll look at an example below.
Let’s go over some general principles of web scraping that are good to follow.
When writing a web scraper, it’s easy to accidentally make a bunch of web requests quickly. This is considered rude, as it might swamp smaller web servers and make it hard for them to respond to requests from other clients.
Also, it might considered a denial-of-service (DoS) attack, and it’s possible your IP address could be blocked, either manually or automatically!
The best way to avoid this is to put a small delay in between requests. The example we’ll look at later on in this article has a 500ms delay between requests, which should be plenty of time to not overwhelm the web server.
As we’ll see in the example, a lot of the HTML out there is not designed to be read by humans, so it can be a bit tricky to figure out how to locate the data to extract.
One option is to do something like finding the seventh p
element in the document. But this is very fragile; if the HTML document page changes even a tiny bit, the seventh p
element could easily be something different.
It’s better to try to find something more robust that seems like it won’t change.
In the example we’ll look at below, to find the main data table, we find the table
element that has the most rows, which should be stable even if the page changes significantly.
Another way to guard against unexpected page changes is to validate as much as you can. Exactly what you validate will be pretty specific to the page you are scraping and the application you are using to do so.
In the example below, some of the things we validate include:
It’s also helpful to include reasonable error messages to make it easier to track down what invariant has been violated when a problem occurs.
Now, let’s look at an example of web scraping with Rust!
In this example, we are going to gather life expectancy data from the Social Security Administration (SSA). This data is available in “life tables” found on various pages of the SSA website.
The page we are using lists, for people born in 1900, their chances of surviving to various ages. The SSA provides a much more comprehensive explanation of these life tables, but we don’t need to read through the entire study for this article.
The table is split into two parts, male and female. Each row of the table represents a different age (that’s the “x” column). The various other columns show different statistics about survival rates at that age.
For our purposes, we care about the “lx” column, which starts with 100,000 babies born (at age 0) and shows how many are still alive at a given age. This is the data we want to capture and save into a JSON file.
The SSA provides this data for babies born every 10 years from 1900-2100 (I assume the data in the year 2100 is just a projection, unless they have time machines over there!). We’d like to capture all of it.
One thing to notice: in 1900, 14 percent of babies didn’t survive to age one! In 2020, that number was more like 0.5 percent. Hooray for modern medicine!
The HTML table itself is kind of weird; because it’s split up into male and female, there are essentially two tables in one table
element, a bunch of header rows, and blank rows inserted every five years to make it easier for humans to read. We’ll have to deal with all this while building our Rust web scraper.
The example code is in this GitHub repo. Feel free to follow along as we look at different parts of the scraper!
reqwest
crateFirst, we need to fetch the webpage. We will use the reqwest
crate for this step. This crate has powerful ways to fetch pages in an async way in case you’re doing a bunch of work at once, but for our purposes, using the blocking API is simpler.
Note that to use the blocking API you need to add the “blocking” feature to the reqwest
dependency in your Cargo.toml
file; see an example at line nine of the file in the Github repo.
Fetching the page is done in the do_throttled_request()
method in scraper_utils.rs
. Here’s a simplified version of that code:
// Do a request for the given URL, with a minimum time between requests // to avoid overloading the server. pub fn do_throttled_request(url: &str) -> Result<String, Error> { // See the real code for the throttling - it's omitted here for clarity let response = reqwest::blocking::get(url)?; response.text() }
At its core, this method is pretty simple: do the request and return the body as a String
. We’re using the ?
operator to do an early return on any error we counter — for example, if our network connection is down.
Interestingly, the text()
method can also fail, and we just return that as well. Remember that since the last line doesn’t have a semicolon at the end, it’s the same as doing the following, but a bit more idiomatic for Rust:
return response.text();
scraper
crateNow to the hard part! We will be using the appropriately-named scraper
crate, which is based on the Servo project, which shares code with Firefox. In other words, it’s an industrial-strength parser!
The parsing is done using the parse_page()
method in your main.rs
file. Let’s break it down into steps.
First, we parse the document. Notice that the parse_document()
call below doesn’t return an error and thus can’t fail, which makes sense since this is code coming from a real web browser. No matter how badly formed the HTML is, the browser has to render something!
let document = Html::parse_document(&body); // Find the table with the most rows let main_table = document.select(&TABLE).max_by_key(|table| { table.select(&TR).count() }).expect("No tables found in document?");
Next, we want to find all the tables in the document. The select()
call allows us to pass in a CSS selector and returns all the nodes that match that selector.
CSS selectors are a very powerful way to specify which nodes you want. For our purposes, we just want to select all table nodes, which is easy to do with a simple Type
selector:
static ref TABLE: Selector = make_selector("table");
Once we have all of the table nodes, we want to find the one with the most rows. We will use the max_by_key()
method, and for the key we get the number of rows in the table.
Nodes also have a select()
method, so we can use another simple selector to get all the descendants that are rows and count them:
static ref TR: Selector = make_selector("tr");
Now it’s time to find out which columns have the “100,000” text. Here’s that code, with some parts omitted for clarity:
let mut column_indices: Option<ColumnIndices> = None; for row in main_table.select(&TR) { // Need to collect this into a Vec<> because we're going to be iterating over it // multiple times. let entries = row.select(&TD).collect::<Vec<_>>(); if column_indices.is_none() { let mut row_number_index: Option<usize> = None; let mut male_index: Option<usize> = None; let mut female_index: Option<usize> = None; // look for values of "0" (for the row number) and "100000" for (column_index, cell) in entries.iter().enumerate() { let text: String = get_numeric_text(cell); if text == "0" { // Only want the first column that has a value of "0" row_number_index = row_number_index.or(Some(column_index)); } else if text == "100000" { // male columns are first if male_index.is_none() { male_index = Some(column_index); } else if female_index.is_none() { female_index = Some(column_index); } else { panic!("Found too many columns with text \"100000\"!"); } } } assert_eq!(male_index.is_some(), female_index.is_some(), "Found male column but not female?"); if let Some(male_index) = male_index { assert!(row_number_index.is_some(), "Found male column but not row number?"); column_indices = Some(ColumnIndices { row_number: row_number_index.unwrap(), male: male_index, female: female_index.unwrap() }); } }
For each row, if we haven’t found the column indices we need, we’re looking for a value of 0
for the age and 100000
for male and female columns.
Note that the get_numeric_text()
function takes care of removing any commas from the text. Also notice the number of asserts and panics here to guard against the format of the page changing too much — we’d much rather have the script error out than get incorrect data!
Finally, here’s the code that gathers all the data:
if let Some(column_indices) = column_indices { if entries.len() < column_indices.max_index() { // Too few columns, this isn't a real row continue } let row_number_text = get_numeric_text(&entries[column_indices.row_number]); if row_number_text.parse::<u32>().map(|x| x == next_row_number) == Ok(true) { next_row_number += 1; let male_value = get_numeric_text(&entries[column_indices.male]).parse::<u32>(); let male_value = male_value.expect("Couldn't parse value in male cell"); // The page normalizes all values by assuming 100,000 babies were born in the // given year, so scale this down to a range of 0-1. let male_value = male_value as f32 / 100000_f32; assert!(male_value <= 1.0, "male value is out of range"); if let Some(last_value) = male_still_alive_values.last() { assert!(*last_value >= male_value, "male values are not decreasing"); } male_still_alive_values.push(male_value); // Similar code for female values omitted } }
This code just makes sure that the row number (i.e. the age) is the next expected value, and then gets the values from the columns, parses the number, and scales it down. Again, we do some assertions to make sure the values look reasonable.
For this application, we wanted the data written out to a file in JSON format. We will use the json
crate for this step. Now that we have all the data, this part is pretty straightforward:
fn write_data(data: HashMap<u32, SurvivorsAtAgeTable>) -> std::io::Result<()> { let mut json_data = json::object! {}; let mut keys = data.keys().collect::<Vec<_>>(); keys.sort(); for &key in keys { let value = data.get(&key).unwrap(); let json_value = json::object! { "female": value.female.clone(), "male": value.male.clone() }; json_data[key.to_string()] = json_value; } let mut file = File::create("fileTables.json")?; write!(&mut file, "{}", json::stringify_pretty(json_data, 4))?; Ok(()) }
Sorting the keys isn’t strictly necessary, but it does make the output easier to read. We use the handy json::object!
macro to easily create the JSON data and write it out to a file with write!
. And we’re done!
Hopefully this article gives you a good starting point for doing web scraping in Rust.
With these tools, a lot of the work can be reduced to crafting CSS selectors to get the nodes you’re interested in, and figuring out what invariants you can use to assert that you’re getting the right ones in case the page changes!
Debugging Rust applications can be difficult, especially when users experience issues that are hard to reproduce. If you’re interested in monitoring and tracking the performance of your Rust apps, automatically surfacing errors, and tracking slow network requests and load time, try LogRocket.
LogRocket is like a DVR for web and mobile apps, recording literally everything that happens on your Rust application. Instead of guessing why problems happen, you can aggregate and report on what state your application was in when an issue occurred. LogRocket also monitors your app’s performance, reporting metrics like client CPU load, client memory usage, and more.
Modernize how you debug your Rust apps — start monitoring for free.
Hey there, want to help make our blog better?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up nowBuild scalable admin dashboards with Filament and Laravel using Form Builder, Notifications, and Actions for clean, interactive panels.
Break down the parts of a URL and explore APIs for working with them in JavaScript, parsing them, building query strings, checking their validity, etc.
In this guide, explore lazy loading and error loading as two techniques for fetching data in React apps.
Deno is a popular JavaScript runtime, and it recently launched version 2.0 with several new features, bug fixes, and improvements […]
4 Replies to "Web scraping with Rust"
Great article. I’m scraping a web page in my side project with selectors crate….
I didn’t know that I can use reqwest in blocking way, thank you
Glad it was helpful! Thanks for the kind words!
Where did you get `make_selector()` from? I cannot find it in the `scraper` crate.
Sorry for the late reply! make_selector() is near the top of the file ( https://github.com/gregstoll/rust-scraping/blob/main/src/main.rs#L9 ) – it’s just a convenience method around Selector::parse(selector).unwrap(). Hope that helps!