Building Smart Web Scrapers with Local LLMs: Advanced HTML Cleaning Techniques
Building Smart Web Scrapers with Local LLMs: Using AI to Create Resilient Data Extraction Tools
A guide to leveraging locally-run Large Language Models (LLMs) to build web scrapers that can adapt to changing website structures and extract data more reliably than traditional CSS selector-based approaches.
When I first started building web scrapers, I kept running into the same problem: brittle CSS selectors. You spend hours crafting the perfect selectors, and then the website updates their class names or restructures their HTML. Suddenly, your scraper breaks, and you’re back to square one. That’s why I started experimenting with locally running LLMs through LM Studio to make my scrapers more resilient.
Cleaning HTML: Preparing Content for Semantic Extraction
When scraping web pages, the raw HTML is often a messy cocktail of scripts, styles, and unnecessary markup. Before sending content to our local LLM, we need to strip away the noise and focus on the meaningful text. Here’s how I approach HTML cleaning:
def clean_html_for_llm(doc)
# Remove script and style tags that bloat our context
doc.css('script, style, link[rel="stylesheet"]').remove
# Remove common non-content elements
doc.css('header, footer, nav, aside, .sidebar, .advertisement').remove
# Clean up remaining HTML
doc.at('body').to_html
end
def prepare_content_for_extraction(url)
# Use Selenium to render JavaScript-heavy pages
@driver.get(url)
# Parse the rendered page
doc = Nokogiri::HTML(@driver.page_source)
# Clean the HTML
cleaned_content = clean_html_for_llm(doc)
# Optionally, truncate very long content to prevent context overflow
cleaned_content[0..15000]
end
Why This Matters
Traditional web scraping often fails because websites are complex, dynamic beasts. By carefully cleaning our HTML, we:
- Reduce Noise: Remove scripts, styles, and navigation elements that distract our LLM
- Improve Context Quality: Focus on the core content
- Prevent Context Overflow: Ensure we don’t exceed token limits
- Enhance Extraction Accuracy: Give the LLM a cleaner signal to work with
Practical Example
Let’s break down what’s happening:
script, style
tags are pure noise for content extraction- Navigation, headers, and footers rarely contain job details
- We use Nokogiri to parse and manipulate the HTML surgically
- Selenium ensures we capture dynamically loaded content
Pro Tips
- Always set a reasonable content length limit
- Use CSS selectors to remove complex, nested elements
- Consider website-specific cleaning rules for tricky job boards
By implementing smart HTML cleaning, we transform brittle web scraping into a robust, AI-powered data extraction system.
The Traditional Problem
Traditional web scrapers typically look something like this:
def scrape_job(doc)
{
title: doc.css('.job-title').text,
company: doc.css('.company-name').text,
location: doc.css('.location').text
}
end
This works great… until it doesn’t. One website update and everything breaks.
A Better Way: Using Local LLMs
Instead of relying on specific HTML structures, we can use LLMs to understand the content semantically. Here’s how I do it:
class JobDetailsExtractor
LM_STUDIO_URL = 'http://127.0.0.1:1234/v1/chat/completions'
def process_page(url)
# First, clean the HTML
doc = Nokogiri::HTML(@driver.page_source)
doc.css('script, style').remove
# Send to local LLM
response = make_lm_studio_request(doc.at('body').to_html)
# Parse structured response
parse_response(response)
end
end
Structured Data Extraction with JSON Schema
The real magic happens in how we format our LLM request. Here’s the key part:
def make_lm_studio_request(content, missing_fields = [])
request.body = {
model: "llama-3.2-3b-instruct",
messages: [
{
role: "system",
content: "You are a job posting analyzer. Extract detailed information..."
}
],
response_format: {
type: "json_schema",
json_schema: {
name: "job_details",
strict: "true",
schema: {
type: "object",
properties: {
title: { type: "string" },
company: { type: "string" },
location: { type: "string" },
salary_range: { type: "string" },
employment_type: { type: "string" },
description: { type: "string" },
requirements: { type: "string" },
benefits: { type: "string" },
government_job: { type: "boolean" },
urgent: { type: "boolean" }
},
required: ["title", "description", "requirements"]
}
}
}
}.to_json
end
By using the `json_schema` parameter, we get:
- Consistent Output: Every response follows our defined structure
- Type Safety: Fields come back in the correct format
- Required Fields: We can specify which fields must be present
- No Post-Processing: The data is ready to use
Making it Production Ready
Rate Limiting
To be a good web citizen, we need rate limiting:
MIN_REQUEST_INTERVAL = 3 # Minimum seconds between requests
MAX_REQUEST_INTERVAL = 5 # Maximum seconds between requests
def process_links
jobs.each do |job|
if @last_web_request_time
elapsed = Time.now - @last_web_request_time
sleep(MIN_REQUEST_INTERVAL - elapsed) if elapsed < MIN_REQUEST_INTERVAL
end
process_job(job)
@last_web_request_time = Time.now
end
end
Error Handling and Text Cleaning
We need robust error handling for both web requests and LLM processing:
def clean_text(text)
return "" if text.nil?
text
.to_s
.encode('UTF-8', invalid: :replace, undef: :replace, replace: '')
.gsub(/[\u0000-\u001F\u007F\u2028\u2029]/, '')
.gsub(/[,\r\n\t]+/, ' ')
.gsub(/\s+/, ' ')
.strip
end
Handling Missing Fields
Sometimes job posts don’t include all the information we want:
def check_missing_fields(job)
missing = []
['company', 'location'].each do |field|
missing << field if job[field].nil? || job[field].strip.empty?
end
if missing.any?
system_prompt += " Additionally, please specifically look for these missing fields: #{missing.join(', ')}."
end
end
Data Storage
I use CSV files for storage, but you could easily adapt this for a database:
def setup_csv
FileUtils.mkdir_p('data')
CSV.open(DETAILS_CSV_PATH, 'w') do |csv|
csv << [
'id',
'title',
'company',
'location',
'salary_range',
'employment_type',
'description',
'requirements',
'benefits',
'processed_date'
]
end
end
Results and Benefits
After implementing this system, I’ve seen:
- 95% accuracy in extracting structured data
- Zero maintenance needed when sites update their HTML
- Ability to handle variations in how data is presented
- Cost savings from running the LLM locally
Setting Up Your Own System
- Download LM Studio from their website
- Download a suitable model (I use llama-3.2-3b-instruct)
- Start the local server
- Configure your scraper to use `http://127.0.0.1:1234/v1/chat/completions`
Conclusion
By combining traditional web scraping with local LLMs, we can build more resilient and intelligent data extraction systems. This approach not only reduces maintenance overhead but also improves the quality of extracted data while keeping costs low and data private.
The complete code is available in my GitHub repository, and you can adapt it for your own use cases.
If you have any questions or need help implementing this approach, feel free to reach out to me at blakelinkd@gmail.com.