Web scraping up until recently has been a hands-on process where the developer has to analyze a site page, find the pattern in the html and write code that can iterate over pages with a similar html layout and then parse the specific data. AI has began to change this process and as a result the time to set up a crawler can be reduced in most cases.
In our example below, we will go through a Python script which uses the scrapegraphai library. This library was chosen because it already has a built in ability to both scrape and interact with a variety of AI APIs. However, you could easily use C# with HTML Agility pack to pull the whole HTML data, make a custom call to OPEN AI or your API of your choice and have the same results.
Note the limitation of this method is that it cannot analyze the Javascript. You would have to use another scraper library with that built in functionality like Selenium.
I will put my notes in the code and prepend them with ##
from scrapegraphai.graphs import SmartScraperGraph
import json
import os
os.environ["OPENAI_API_KEY"] =""
# graph_config = {
# "llm": {
# "model": "ollama/llama3:70b",
# "temperature": 3,
# "format": "json", # Ollama needs the format to be specified explicitly
# "base_url": "http://localhost:11434",
# },
# "embeddings": {
# "model": "ollama/nomic-embed-text",
# "base_url": "http://localhost:11434",
# }
# }
## the config sets up the library to interact with the AI API
## note the LLM option allows you to set the model type and temperature.
## The higher the temperature the more freedom it gives the AI to respond
## where as a .1 will be very concise. The Embedding is something that is often
## used in RAG implementations. Because the HTML code sent to the AI could be
## large, there is a chance the data could exceed the context limitation,
## so the data is embedded for the AI to review.
graph_config = {
"llm": {
"model": "gpt-4o",
"temperature": 0.5,
"base_url": "https://api.openai.com/v1"
},
"embeddings": {
"model": "openai/embedding-ada-002",
"base_url": "https://api.openai.com/v1",
"api_key": os.getenv("OPENAI_API_KEY")
}
}
def scrape_data(url, prompt):
smart_scraper_graph = SmartScraperGraph(
prompt=prompt,
source=url,
config=graph_config
)
scraped_data = smart_scraper_graph.run()
return scraped_data
iso_url_list = [
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISOFasteners',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISOBolts',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISOKeys',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISONuts',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISOPins',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISORivets',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISOScrews',
'https://www.fastenersclearinghouse.com/fch/main.nsf/fISOWashers'
]
## The prompt is the most important part of this, like most things AI related,
## specifically requesting the data to be returned in a list allowed me to
## do the same process below with a different set of URLs retrieved
sub_url_list = []
for url in iso_url_list:
sub_url_list = scrape_data(
prompt="return all the links with ISO in it in a list with no dictionary only the list of urls",
url=url
)
iso_data = []
for url in sub_url_list:
iso_data.append(scrape_data(
prompt="Please create a json file of all the ISO TYPEs with their name, diameter sizes and any other important information",
url=url
))
json_data = json.dumps(iso_data)
with open('output_data.json', 'w') as json_file:
json_file.write(json_data)
Here is a YouTube tutorial which goes over this type of scraping in more detail:
Web Scraping AI AGENT, that absolutely works 😍