Editor’s note: This article was last updated on 17 October 2024.
In this article, we’ll explore a few of the best Node.js web scraping libraries and techniques. You’ll also learn about their differences, considering when each is the right fit for your project’s needs.
Whether you want to build your own search engine, monitor a website to alert you when tickets for your favorite concert are available, or you need essential information for your company, there are many Node.js web scraper libraries that have you covered.
If you’re familiar with Axios, it might not sound like the most appealing option for scraping the web. Be that as it may, it is a simple solution that can help you get the job done, and it offers the added benefit of being a library you likely already know quite well.
Axios is a promised-based HTTP client for Node.js that became super popular among JavaScript projects for its simplicity and adaptability. Although Axios is typically used in the context of calling REST APIs, it can fetch websites’ HTML as well.
Because Axios will only get the response from the server, it will be up to you to parse and work with the result. Therefore, I recommend using this library when working with JSON responses or for simple scraping needs.
You can install Axios using your favorite package manager as follows:
npm install axios
Below is an example of using Axios to list all the articles headlines from the LogRocket Blog’s homepage:
const axios = require('axios'); axios .get("https://logrocket.com/blog") .then(function (response) { const reTitles = /(?<=\<h2 class="card-title"><a\shref=.*?\>).*?(?=\<\/a\>)/g; [...response.data.matchAll(reTitles)].forEach(title => console.log(`- ${title}`)); });
In the example above, you can see how Axios is great for HTTP requests. However, parsing the HTML in complex structures requires elaborating complex rules, or regular expressions, even for simple tasks.
So, if regular expressions aren’t your thing and you prefer a more DOM-based approach, you could transform the HTML into a DOM-like object with libraries like JSDom or Cheerio. Let’s explore the same example from above using JSDom instead:
const axios = require('axios'); const jsdom = require("jsdom"); const { JSDOM } = jsdom; axios .get("https://logrocket.com/blog") .then(function (response) { const dom = new JSDOM(response.data); [...dom.window.document.querySelectorAll('.card-title a')].forEach(el => console.log(`- ${el.textContent}`)); });
This kind of solution would soon encounter its limitations. For example, you’ll only get the raw response from the server — what if elements on the page you want to access are loaded asynchronously?
What about single-page applications (SPAs), where the HTML simply loads JavaScript libraries that do all the rendering work on the client? Or what if you encounter one of the limitations imposed by such libraries? After all, they aren’t a full HTML/DOM implementation but a subset of the same.
In scenarios like these, or for complex websites, the best choice may be a completely different approach using other libraries.
Puppeteer is a high-level Node.js API to control Chrome or Chromium with code. So, what does it mean for us in terms of web scraping?
With Puppeteer, you access the power of a full-fetch browser like Chromium, running in the background in headless mode, to navigate websites and fully render styles, scripts, and asynchronous information.
To use Puppeteer in your project, you can install it like any other JavaScript package:
npm install puppeteer
Now, let’s see an example of Puppeteer in action:
const puppeteer = require("puppeteer"); async function parseLogRocketBlogHome() { // Launch the browser const browser = await puppeteer.launch(); // Open a new tab const page = await browser.newPage(); // Visit the page and wait until network connections are completed await page.goto('https://logrocket.com/blog', { waitUntil: 'networkidle2' }); // Interact with the DOM to retrieve the titles const titles = await page.evaluate(() => { // Select all elements with crayons-tag class return [...document.querySelectorAll('.card-title a')].map(el => el.textContent); }); // Don't forget to close the browser instance to clean up the memory await browser.close(); // Print the results titles.forEach(title => console.log(`- ${title}`)) } parseLogRocketBlogHome();
While Puppeteer is a fantastic solution, it is more complex to work on, especially for simple projects. It is also much more demanding in terms of resources — you are, after all, running a full Chromium browser, and we know how memory-hungry those can be.
X-Ray is a Node.js library created for scraping the web, so it’s no surprise that its API is heavily focused on that task. As such, it abstracts most of the complexity we encounter in Puppeteer and Axios.
To install X-Ray, you can run the following command:
npm install x-ray
Now, let’s build our example using X-Ray:
const Xray = require('x-ray'); const x = Xray() x('https://logrocket.com/blog', { titles: ['.card-title a'] })((err, result) => { result.titles.forEach(title => console.log(`- ${title}`)); });
X-Ray is a great option if your use case involves scraping a large number of webpages. It supports concurrency and pagination out of the box, so you don’t need to worry about those details.
Osmosis is very similar to X-Ray, designed explicitly for scraping webpages and extracting data from HTML, XML, and JSON documents.
To install Osmosis, run the following code:
npm install osmosis
Below is the sample code:
var osmosis = require('osmosis'); osmosis.get('https://logrocket.com/blog') .set({ titles: ['.card-title a'] }) .data(function(result) { result.titles.forEach(title => console.log(`- ${title}`)); });
As you can see, Osmosis is similar to X-Ray in terms of syntax and style used to retrieve and work with data.
Superagent is a lightweight, progressive, client-side Node.js library for handling HTTP requests. Due to its simplicity and ease of use, it is commonly used for web scraping.
Just like Axios, Superagent is also limited to only getting the response from the server; it will be up to you to parse and work with the result. Depending on your scraping needs, you can retrieve HTML pages, JSON data, or other types of content using Superagent.
To use Superagent in your project, you can install it like any other JavaScript package:
npm install superagent
When scraping HTML pages, you must parse the HTML content to extract the desired data. For this, you can use libraries like Cheerio or JSDOM.
To use Cheerio in your project, you can install it like any other JavaScript package:
npm install cheerio
Let’s review an example of web scraping with Superagent and Cheerio in action:
const superagent = require("superagent"); const cheerio = require("cheerio"); const url = "https://blog.logrocket.com"; superagent.get(url).end((err, res) => { if (err) { console.error("Error fetching the website:", err); return; } const $ = cheerio.load(res.text); // Replace the following selectors with the actual HTML elements you want to scrape const titles = $(".card-title a") .map((i, el) => $(el).text()) .get(); const descriptions = $("p.description") .map((i, el) => $(el).text()) .get(); // Display the scraped data console.log("Titles:", titles); console.log("Descriptions:", descriptions); });
The script will make an HTTP GET request to the specified URL using Superagent, fetch the HTML content of the page, and then use Cheerio to extract the data from the specified selectors.
While Superagent is a great solution, using it for web scraping may result in incomplete or inaccurate data extraction resulting in data inconsistency, depending on the complexity of the website’s structure and the parsing methods used.
Playwright is a powerful tool for web scraping and browser automation, especially when dealing with modern web applications with dynamic content and complex interactions. Its multibrowser support, automation capabilities, and performance make it an excellent choice for developers looking to perform advanced web scraping tasks in Node.js applications.
Playwright is a relatively new open source library developed by Microsoft. It provides complete control over the browser’s state, cookies, network requests, and browser events, making it ideal for complex scraping scenarios.
To use Playwright in your project, you can install it like so:
npm install playwright
Let’s look at an example of web scraping with Playwright:
const { chromium } = require("playwright"); (async () => { const browser = await chromium.launch(); const context = await browser.newContext(); const page = await context.newPage(); const url = "https://blog.logrocket.com"; // Replace with the URL of the website you want to scrape try { await page.goto(url); // Replace the following selectors with the actual HTML elements you want to scrape const titleElement = await page.$("h1"); const descriptionElement = await page.$("p.description"); const title = await titleElement.textContent(); const description = await descriptionElement.textContent(); const inputElement = await page.$('input[type="text"]'); const value = await inputElement.inputValue(); console.log(value); console.log("Title:", title); console.log("Description:", description); } catch (error) { console.error("Error while scraping:", error); } finally { await browser.close(); } })();
The script will launch a Chromium browser, navigate to the specified URL, and use Playwright’s methods to interact with the website and extract data from the specified selectors.
Playwright is a robust scraping library, but when compared to lightweight HTTP-based scraping libraries, it incurs more resource overhead because it uses headless browsers to perform scraping tasks. This can have an impact on performance and memory usage, especially if you’re scraping multiple pages or performing a large number of scraping tasks.
Although web scraping is legal for publicly available information, you should be aware that many sites put limitations in place as part of their terms of service. Some may even include rate limits to prevent you from slowing down their services — but why is that?
When you scrape information from a site, you use its resources.
Let’s suppose you’re aggressive in terms of accessing too many pages too quickly. In that case, you may degrade the site’s general performance for its users. So, when scraping the web, you must get consent or permission from the owner and be mindful of the strains you are putting on their sites.
Lastly, web scraping requires a considerable effort for development and, in many cases, maintenance. Changes in the structure of the target site may break your scraping code and require you to update your script to adjust to the new formats.
For this reason, I prefer consuming an API when possible and scraping the web only as a last resort.
Ultimately, the best Node.js scraper is the one that best fits your project needs. In this article, we covered some factors to help influence your decision.
For most tasks, any of these options will suffice, so choose the one you feel most comfortable with. In my professional life, I’ve had the opportunity to build multiple projects with information-gathering requirements from publicly available information and internal systems.
Because the requirements were diverse, each of these projects used different approaches and libraries, ranging from Axios to X-Ray, and ultimately resulting in Puppeteer for the most complex situations.
Finally, you should always respect the website’s terms and conditions regardless of what scraper you choose. Scraping data can be a powerful tool, but with that comes great responsibility. Thanks for reading!
Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you’re interested in ensuring requests to the backend or third-party services are successful, try LogRocket.
LogRocket is like a DVR for web and mobile apps, recording literally everything that happens while a user interacts with your app. Instead of guessing why problems happen, you can aggregate and report on problematic network requests to quickly understand the root cause.
LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.
Would you be interested in joining LogRocket's developer community?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up nowLearn how to manage memory leaks in Rust, avoid unsafe behavior, and use tools like weak references to ensure efficient programs.
Bypass anti-bot measures in Node.js with curl-impersonate. Learn how it mimics browsers to overcome bot detection for web scraping.
Handle frontend data discrepancies with eventual consistency using WebSockets, Docker Compose, and practical code examples.
Efficient initializing is crucial to smooth-running websites. One way to optimize that process is through lazy initialization in Rust 1.80.
One Reply to "The best Node.js web scrapers for your use case"
Great article! I’m scaling puppeteer on remote browsers (so I can run thousands of headed browsers concurrently) and use browserless platform to do that… I was wondering if these technologies you mention in the article are compatible with such a technology? It requires to use puppeteers .connect() method instead of .launch()
Again, great article!