Web Scraping with LLMs Using Langchain

TLDR

This article explores the synergy between Language Models (LLMs) and LangChain in the realm of web scraping. The integration of LLMs, illustrated using Azure OpenAI, streamlines the extraction of valuable insights from web data. The provided code, utilizing BeautifulSoup and html2text, showcases a robust process for scraping web content and converting it into structured information. With LangChain’s JSON output capabilities, the approach becomes not only efficient but also highly adaptable. This amalgamation of technologies presents a powerful solution for automating and structuring web data with intelligence, revolutionizing the conventional web scraping process. You can find code here

Collecting data from the web manually is a tedious task. That's why we have web scraping tools… which can visit the URL and bring back the data that the page contains. This automates the task of getting the data from page however the data received through this is a Raw text data and needs to be labelled manually to store as a structured data. But now with LLM’s intelligence and Langchain’s feasability of integrating LLM’s into applications, it is possible to store scraped web data as a strctured insights.

If you don’t know What Lang chain is, it is a framework to integrate LLM’s with applications. In this example, we use BeautifulSoup to scrape URL, html2text to convert raw HTML data into useful text format and finally LangChain to convert text into useful insights.

Installing dependencies

pip install html2text requests beautifulsoup4 langchain langchain_openai

Web scraping with Beautifulsoup and html2text

import requests
from bs4 import BeautifulSoup
import html2text

def extract_html_from_url(url):
    try:
        # Fetch HTML content from the URL using requests
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad responses (4xx and 5xx)

        # Parse HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        excluded_tagNames = ['footer', 'nav']
        # Exclude elements with class names 'footer' and 'navbar'
        excluded_tags = excluded_tagNames or []  # Default to an empty list if not provided
        for tag_name in excluded_tags:
            for unwanted_tag in soup.find_all(tag_name):
                unwanted_tag.extract()


        # Convert HTML to plain text using html2text
        text_content = html2text.html2text(str(soup))
        return text_content

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data from {url}: {e}")
        return f"Error fetching data from {url}: {e}"

The above code returns the page contents in text format. Here, we are excluding the menu and footer sections of the page as we do not need them when extracting page contents. You can add more tags if you find any unnecessary tags in your usecase.

Generating insights with LangChain

import os
from llmoutputparser import person_information_parse
from webscraper import extract_html_from_url

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import AzureChatOpenAI

os.environ["AZURE_OPENAI_API_KEY"] = ""  # Get your key from https://platform.openai.com/account/api-keys
os.environ["AZURE_OPENAI_ENDPOINT"] = ""


def  extract_company_info(url: str):
print("Hello LLM")

summary_template = """ given the company information {information} of a company on Y-combinator page in html format, I want you to extract information about the company. You are not allowed to make any assumptions while extracting the information. Every link you provide should be from the information given. There should be no assumptions for Links/URLS. You should not return code to do it.:
You should extract the following text infromation from the html:
1. Full Name of the company.
2. Founder/Founders of the company. For each founder You should Include linkedin URL of the founders present in the information given to you.You should not assume any other information about the founder and linkedin URL's.
3. Founded Year.
4. Team Size.
5. If Job postings by the company available in the given information. For each job post you should include title and location.
6. if Launch Posts of the company available in the given information. For each launch post you should include title and URL or web link of that launch post.
7. Website of the Company.
8.Location of the Company.
9.Email of the Company.
10.Company's LinkedIn URL.
11.Business Verticals of the company
"""
llm_model = AzureChatOpenAI(
openai_api_type="azure",
deployment_name="mockman-interviewdata",
openai_api_version="2023-07-01-preview",
temperature=0,
)

prompt = PromptTemplate(
template=summary_template,
input_variables=["information"],
)
llm_chain = LLMChain(llm=llm_model, prompt=prompt)


company_profile_data = extract_html_from_url(url)
user_data = llm_chain.invoke(
input={"information": company_profile_data},
return_only_outputs=True,
)
return user_data["text"]

In the above code I am using Azure Open AI model to generate insights. Initialization of the model is pretty straightforward. Import the AzureChatOpenAI from langchain and pass the above parameters... or if you want to use normal openAI then you can simply use this code.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613", openai_api_key="YOUR_API_KEY")

Prompt template configures your LLM prompt to take dynamic variables, in our case it is company information. We can pass multiple dynamic variables to make our prompt more reliable to what we are expecting from it.

prompt = PromptTemplate(
        template=summary_template,
        input_variables=["information"],#list of dynamic data variables
    )

Next we have LLM Chains, In LangChain Chains refer to sequences of calls — whether to an LLM, a tool, or a data preprocessing step. In our case we need LLM to generate insights in a step by step format. LLMChain takes llm_model and prompt as inputs. With this we have configured LLMChain to execute series of calls.

llm_chain = LLMChain(llm=llm_model, prompt=prompt)

Now we can get company information by calling function extract_html_from_url(url) wheer url here is “https://www.ycombinator.com/companies/razorpay”. You can use any URL however, you need to change prompt template to serve the data extracted from web page

company_profile_data = extract_html_from_url(url)

Once we have company information returned from the above function we can invoke llm_chain function to start generating the insights that we needed.

company_profile_data = extract_html_from_url(url)
user_data = llm_chain.invoke(
input={"information": company_profile_data},
return_only_outputs=True,
)
return user_data["text"]

By running this function you will get the response as:

Hello  LLM
1. Full Name of the company:  Razorpay
2. Founder/Founders of the company:
a.  Harshil  Mathur  (CEO  &  Co-Founder)  -  LinkedIn URL:  https://www.linkedin.com/pub/harshil-mathur/69/23a/127
b.  Shashank  Kumar  (MD  &  Co-Founder)  -  LinkedIn URL:  http://www.linkedin.com/in/kumarshashank
3. Founded Year:  2014
4. Team Size:  2700
5. Job postings by the company available in the given information:
a. Engineering Manager/Sr. Engineering Manager - Location:  Bangalore,  Karnataka
b. Senior / Principal Engineers - Location:  Bangalore
c. Senior Infrastructure Engineer - Location:  Bengaluru
d.  Senior/Principal  Engineer/Manager,  Frontend Engineering - Location:  Bangalore,  Karnataka
6. Launch Posts of the company available in the given information:
a. ET Startup Awards 2022: Razorpay wins Startup of the Year award - URL:  https://economictimes.indiatimes.com/tech/startups/et-startup-awards-2022-razorpay-wins-startup-of-the-year-award/articleshow/95185680.cms
b.  Razorpay  Helps  FinTech  HostBooks  Raise  $3M  -  URL:  https://www.pymnts.com/news/investment-tracker/2022/razorpay-helps-fintech-hostbooks-raise-3m/
c. YC-backed Indian fintech unicorn enters SEA - URL:  https://www.techinasia.com/ycbacked-fintech-unicorn-razorpay-enters-sea
d.  Indian  fintech  giant  Razorpay  valued  at  $7.5  billion  in  $375  million  funding  –  URL:  https://techcrunch.com/2021/12/19/indian-fintech-giant-razorpay-valued-at-7-5-billion-in-375-million-funding/
e.  India’s  Razorpay  launches  faster  checkout  feature,  tops  $60  billion  TPV  –  URL:  https://techcrunch.com/2021/12/08/india-razorpay-launches-faster-checkout-feature-tops-60-billion-tpv/
7. Website of the Company:  https://razorpay.com
8. Location of the Company:  Bengaluru,  India
9. Email of the Company:  Not  available  in  the  given  information.
10.  Company's  LinkedIn URL:  https://www.linkedin.com/company/razorpay/
11. Business Verticals of the company:  Payments,  Finance,  India.

This is a decent data labelling using LLM; however, we can still do better by imposing data checks to return JSON data

from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

from typing import  List, Dict


class  companyInformation(BaseModel):
company_name: str = Field(description="Name of the company")
founders: List[Dict[str, str]] = Field(description="Founders of the company")
founded_year: int = Field(description="Founded year of the company")
team_size: int = Field(description="Employes of the company")
jobs: List[Dict[str, str]] = Field(description="Jobs of the company")
launch_posts: List[Dict[str, str]] = Field(description="Launch posts of the company")
website: str = Field(description="Website of the company")
location: str = Field(description="Location of the company")
email: str = Field(description="Email of the company")
linkedin_url: str = Field(description="Linkedin URL of the company")
verticals: List[str] = Field(description="Verticals of the company")



company_information_parse = JsonOutputParser(
pydantic_object=companyInformation
)

By passing this class company Information to llm_model and to prompt_template as format_instructions

summary_template = """ given the company information {information} of a company on Y-combinator page in html format, I want you to extract information about the company. You are not allowed to make any assumptions while extracting the information. Every link you provide should be from the information given. There should be no assumptions for Links/URLS. You should not return code to do it.:
You should extract the following text infromation from the html:
1. Full Name of the company.
2. Founder/Founders of the company. For each founder You should Include linkedin URL of the founders present in the information given to you.You should not assume any other information about the founder and linkedin URL's.
3. Founded Year.
4. Team Size.
5. If Job postings by the company available in the given information. For each job post you should include title and location.
6. if Launch Posts of the company available in the given information. For each launch post you should include title and URL or web link of that launch post.
7. Website of the Company.
8.Location of the Company.
9.Email of the Company.
10.Company's LinkedIn URL.
11.Business Verticals of the company
\n{format_instructions} # here we are passing format_instructions
"""
prompt = PromptTemplate(
template=summary_template,
input_variables=["information"],
partial_variables={"format_instructions": company_information_parse.get_format_instructions()},
)

We can get structured JSON output like this:

Hello LLM
{
"company_name": "Razorpay",
"founders": [
{
"Harshil Mathur": "https://www.linkedin.com/pub/harshil-mathur/69/23a/127"
},
{
"Shashank Kumar": "http://www.linkedin.com/in/kumarshashank"
}
],
"founded_year": 2014,
"team_size": 2700,
"jobs": [
{
"title": "Engineering Manager/Sr. Engineering Manager.",
"location": "Bangalore, Karnataka"
},
{
"title": "Senior / Principal Engineers",
"location": "bangalore"
},
{
"title": "Senior Infrastructure Engineer",
"location": "Bengaluru"
},
{
"title": "Senior/Principal Engineer/Manager, Frontend Engineering",
"location": "Bangalore, Karnataka"
}
],
"launch_posts": [
{
"title": "ET Startup Awards 2022: Razorpay wins Startup of the Year award",
"url": "https://economictimes.indiatimes.com/tech/startups/et-startup-awards-2022-razorpay-wins-startup-of-the-year-award/articleshow/95185680.cms"
},
{
"title": "Razorpay Helps FinTech HostBooks Raise $3M",
"url": "https://www.pymnts.com/news/investment-tracker/2022/razorpay-helps-fintech-hostbooks-raise-3m/"
},
{
"title": "YC-backed Indian fintech unicorn enters SEA",
"url": "https://www.techinasia.com/ycbacked-fintech-unicorn-razorpay-enters-sea"
},
{
"title": "Indian fintech giant Razorpay valued at $7.5 billion in $375 million funding",
"url": "https://techcrunch.com/2021/12/19/indian-fintech-giant-razorpay-valued-at-7-5-billion-in-375-million-funding/"
},
{
"title": "India’s Razorpay launches faster checkout feature, tops $60 billion TPV",
"url": "https://techcrunch.com/2021/12/08/india-razorpay-launches-faster-checkout-feature-tops-60-billion-tpv/"
}
],
"website": "https://razorpay.com",
"location": "Bengaluru, India",
"email": null,
"linkedin_url": "https://www.linkedin.com/company/razorpay/",
"verticals": [
"payments",
"finance",
"india"
]
}

You can view the working code in Colab here.

Overall, by utilizing LangChain, we can incorporate ChatGPT-like intelligence into applications to address our specific use cases. In this instance, it involves generating insights and transforming raw text data into a structured JSON object.

Also read: How to Scrape Amazon Using a No-Code Scraper