Logo
Published on

Unlock the Full Potential of Web Scraping with Bright Data’s Advanced Scraping Browser

Authors

Effortlessly address complex scraping issues and maximize your data extraction capabilities with a scalable, efficient solution.

Bright Data-Scraping Browser

Introduction

Web scraping has become an invaluable tool for businesses and developers across various sectors, and tools like Cheerio and jQuery are widely used, as they allow for DOM parsing and HTML/XML traversal to select the necessary data.

Despite their popularity, these tools fall short when it comes to extracting dynamic data or working with websites that aren’t server-side rendered — the data simply isn’t present in the initial DOM, after all.

This is where more powerful solutions like Puppeteer or Playwright come into play — full-fledged browsers capable of parsing JavaScript and XHR requests, providing access to dynamic data. Sounds like the perfect solution, right? Why not just use Puppeteer for everything?

Well, not quite. These tools have their own limitations, including:

  • They are significantly slower and more resource-intensive. In fact, if you’re using serverless platforms like AWS Lambda for scraping (a popular choice due to the fresh IPs these services provide), these environments often can’t support a full Puppeteer instance because of its resource demands.
  • Many websites deploy techniques to block automated scrapers, such as IP detection, browser fingerprinting, and CAPTCHA prompts, making it difficult to get the desired data.
  • So, you’ll need to design not just a robust scraper but also manage proxies, handle automated retries in case of failures, solve CAPTCHAs, and manage headers and browser canvases, among other things.
  • Building and maintaining such complex, browser-based scrapers quickly becomes a laborious task due to these overlapping technical challenges.

This is where Bright Data’s automated Scraping Browser offers a unique advantage. Like Puppeteer and Playwright, it functions as a programmable browser, but with enhanced features designed to address these pain points.

In this article, we’ll explore exactly what these features are, how the Scraping Browser works, and why it may be the right choice for your web scraping projects.

What is Bright Data’s Scraping Browser?

Bright Data’s Scraping Browser is a comprehensive solution that combines the functionality of a real, automated browser with advanced unlocker infrastructure and built-in proxy and fingerprint management.

How easy is it to use? Let’s break it down:

  • First, you connect to Bright Data’s Scraping Browser through Websockets using your credentials via Puppeteer or Playwright.
  • From there, your focus remains solely on building the scraper with the standard Puppeteer or Playwright libraries — no additional setup required.

It really is that straightforward. There’s no need to juggle multiple third-party libraries for managing proxies, fingerprinting, IP rotation, automated retries, logging, or CAPTCHA solving. The Scraping Browser takes care of all these tasks server-side on Bright Data’s infrastructure.

Even better, the browser’s API is fully compatible with existing Puppeteer and Playwright scripts, making both integration and migration incredibly easy.

Learn More:

Scraping Browser API - Automated Browser for Scraping

Getting Started with Bright Data’s Scraping Browser

For this article, we’ll be integrating the Scraping Browser with a Puppeteer script. Let’s walk through the steps to get Bright Data’s Scraping Browser set up and running quickly.

Prerequisites

  • Visit the Scraping Browser page and create a new account for free by clicking on Start free trial/Start free with Google. Once you’ve finished the signup process, Bright Data will give you a free credit of 5$ for testing purposes.
  • After the account is created, navigate to the Proxy and Scraping Infrastructure section and click the Add button from the top-right section. Choose Scraping Browser from the list.

  • Give a Zone name to your scraping browser, or move forward with the auto-generated one by clicking on Add.

  • After you confirm, the browser is created, and it will take you to a new page with the Access Parameters tab. This is where you’ll find the required Username, Password, and Host values.

Note down these values, you’ll need them later.

Coding the Scraper

Now, let’s explore how effortless it is to integrate an existing Puppeteer scraper with the Scraping Browser to take advantage of its proxy and unlocker capabilities. We’ll use a basic script that scrapes Medium’s programming tag to gather article titles, links, and summaries.

const puppeteer = require("puppeteer-core");
const { setTimeout } = require('node:timers/promises');
// Use NodeJS built in timer module as Puppeteer >=22 has no waitForTimeout()


// should look like 'brd-customer-`<ACCOUNT ID>`-zone-`<ZONE NAME>`:`<PASSWORD>`'
const auth = "your-username:your-password";
async function scrape() {
    let browser;
    try {
        // Here's what makes this all possible
        browser = await puppeteer.connect({
            browserWSEndpoint: \`wss://${@zproxy.lum-superproxy.io">auth}@zproxy.lum-superproxy.io:9222\`,
        });


        const page = await browser.newPage();


        page.setDefaultNavigationTimeout(2 \* 60 \* 1000);


        await page.goto("https://medium.com/tag/programming/archive");


        // wait for 3 seconds (give time for page to trigger data load)
        await setTimeout(3000)


        // scroll down 1000px
        await autoScroll(page);


        // extract
        const articlesData = await page.evaluate(() => {
            const articles = document.querySelectorAll("article");
            const articlesDataArray = \[\];


            articles.forEach((article) => {
                const link = article.querySelector('a h2').parentElement.href
                const title = article.querySelector('a h2').innerHTML
                const summary = article.querySelector('a h3') ? article.querySelector('a h3').innerHTML : ""
                // here's the article data
                const articleData = {
                    title: title,
                    link: link,
                    summary: summary,
                };


                articlesDataArray.push(articleData);
            });


            return articlesDataArray;
        });


        // print it out
        console.log(JSON.stringify(articlesData, null, 2));
    } catch (e) {
        console.error("run failed", e);
    } finally {
        await browser?.close();
    }
}
if (require.main == module) scrape();


// auto scroll by 200 px, 5 times.
async function autoScroll(page) {
    let currentScroll = 0;
    while (
        currentScroll < 1000
    ) {
        await page.evaluate(\`window.scrollBy(0, 200);\`);
        currentScroll += 200;
        await setTimeout(1000)
    }
}

What makes this work is Puppeteer’s capability to connect to an existing, remote browser instance — such as Bright Data’s Scraping Browser — using puppeteer.connect() via WebSockets. Once connected, the remote browser instance on Bright Data’s end takes care of all the challenges tied to scalable and reliable web scraping operations.

The rest of the script follows typical Puppeteer practices:

  • Navigate to the target URL (in this case, https://medium.com/tag/programming/archive).
  • Wait for a short duration (3 seconds here) to allow the initial data load to complete.
  • Since it’s an infinite scrolling page, scroll down programmatically by a set amount (we’re scrolling 1000 pixels) to trigger additional content loading.
  • Use a combination of document.querySelectorAll and document.querySelector to extract the desired elements and their attributes, such as href, or innerHTML.
  • Format the extracted data as needed (we’re using a JSON array) and handle it accordingly (in this example, we’re logging it to the console, though in a real-world application, you’d likely store it elsewhere).

Conclusion

In summary, Bright Data’s Scraping Browser is a powerful tool that can transform how you handle web scraping. With its advanced features ready to use without any extra infrastructure or code, it stands out from traditional web scraping methods. By automating essential tasks and ensuring compliance with data protection laws, Bright Data’s Scraping Browser simplifies data extraction, allowing you to concentrate solely on your scraping logic.

Here’s a recap of the key benefits:

  • Proxy management and rotation: Bright Data’s Scraping Browser automatically handles proxy management and rotation, letting you focus entirely on your core scraping logic while it manages the proxies for you.
  • CAPTCHA solving: Thanks to Bright Data’s robust unlocker infrastructure, the Scraping Browser reliably solves CAPTCHAs without requiring any additional third-party libraries or integrations.
  • Data protection compliance: The Scraping Browser and its proxy infrastructure fully comply with data protection regulations, including GDPR and the California Consumer Privacy Act (CCPA).
  • Simplified development: Building and maintaining browser-based scrapers can be complex and resource-intensive, especially when managing proxies, obfuscating device fingerprints, and other overlapping tasks. Bright Data’s Scraping Browser streamlines this process, allowing you to focus on your Puppeteer or Playwright scraping code.
  • Enhanced performance: The Scraping Browser improves the efficiency of your scraping process, resolving Puppeteer/Playwright bugs and optimizing proxy management and CAPTCHA solving to ensure fast, consistent, and accurate scraping behavior.

Whether you’re a business or a developer seeking a reliable, efficient, and compliant web scraping solution, Bright Data’s Scraping Browser offers the perfect tool with minimal infrastructure needs. Its compatibility with Puppeteer/Playwright and other developer-friendly features make migrating from local scripts to the Scraping Browser an easy and worthwhile choice.

👉 Sign up today and discover the power of Bright Data’s Scraping Browser:

Scraping Browser API - Automated Browser for Scraping