Building Smart Web Scrapers with Local LLMs
Building Smart Web Scrapers with Local LLMs: Using AI to Create Resilient Data Extraction Tools
A guide to leveraging locally-run Large Language Models (LLMs) to build web scrapers that can adapt to changing website structures and extract data more reliably than traditional CSS selector-based approaches.
When I first started building web scrapers, I kept running into the same problem: brittle CSS selectors. You spend hours crafting the perfect selectors, and then the website updates their class names or restructures their HTML. Suddenly, your scraper breaks, and you’re back to square one. That’s why I started experimenting with locally running LLMs through LM Studio to make my scrapers more resilient.
The Traditional Problem
Traditional web scrapers typically look something like this:
def scrape_job(doc)
{
title: doc.css('.job-title').text,
company: doc.css('.company-name').text,
location: doc.css('.location').text
}
end
This works great… until it doesn’t. One website update and everything breaks.
A Better Way: Using Local LLMs
Instead of relying on specific HTML structures, we can use LLMs to understand the content semantically. Here’s how I do it:
class JobDetailsExtractor
LM_STUDIO_URL = 'http://127.0.0.1:1234/v1/chat/completions'
def process_page(url)
# First, clean the HTML
doc = Nokogiri::HTML(@driver.page_source)
doc.css('script, style').remove
# Send to local LLM
response = make_lm_studio_request(doc.at('body').to_html)
# Parse structured response
parse_response(response)
end
end
Structured Data Extraction with JSON Schema
The real magic happens in how we format our LLM request. Here’s the key part:
def make_lm_studio_request(content, missing_fields = [])
request.body = {
model: "llama-3.2-3b-instruct",
messages: [
{
role: "system",
content: "You are a job posting analyzer. Extract detailed information..."
}
],
response_format: {
type: "json_schema",
json_schema: {
name: "job_details",
strict: "true",
schema: {
type: "object",
properties: {
title: { type: "string" },
company: { type: "string" },
location: { type: "string" },
salary_range: { type: "string" },
employment_type: { type: "string" },
description: { type: "string" },
requirements: { type: "string" },
benefits: { type: "string" },
government_job: { type: "boolean" },
urgent: { type: "boolean" }
},
required: ["title", "description", "requirements"]
}
}
}
}.to_json
end
By using the `json_schema` parameter, we get:
- Consistent Output: Every response follows our defined structure
- Type Safety: Fields come back in the correct format
- Required Fields: We can specify which fields must be present
- No Post-Processing: The data is ready to use
Making it Production Ready
Rate Limiting and Ethical Scraping
When building web scrapers, being a good web citizen is crucial. With LLM-powered scrapers, we have a unique advantage: the inherent processing time of the language model can naturally introduce beneficial rate limiting.
class WebScraper
# Configurable rate limiting parameters
MIN_REQUEST_INTERVAL = 3 # Minimum seconds between requests
MAX_REQUEST_INTERVAL = 5 # Maximum seconds between requests
LLM_PROCESSING_BUFFER = 2 # Additional buffer for LLM processing time
def initialize
@last_web_request_time = nil
@request_count = 0
@max_requests_per_session = 50 # Prevent excessive scraping
end
def process_links(jobs)
jobs.each do |job|
# Check total request limit
break if @request_count >= @max_requests_per_session
# Enforce minimum time between requests
enforce_rate_limit
# Process the job with LLM
process_job(job)
# Track request metrics
@request_count += 1
@last_web_request_time = Time.now
end
end
private
def enforce_rate_limit
if @last_web_request_time
elapsed = Time.now - @last_web_request_time
# Factor in LLM processing time as a natural delay
sleep_time = [
MIN_REQUEST_INTERVAL - elapsed,
LLM_PROCESSING_BUFFER
].max
sleep(sleep_time) if sleep_time > 0
end
end
def log_scraping_activity(job)
# Optional: Log scraping activities for monitoring
File.open('scraping_log.txt', 'a') do |file|
file.puts "#{Time.now}: Processed job - #{job['title']}"
end
end
end
Rate Limiting Strategies
The beauty of LLM-powered web scraping is that the model’s processing time naturally introduces a delay between requests. This has several benefits:
- Reduced Server Load: By spacing out requests, we minimize the impact on target websites.
- Avoiding IP Blocks: Gradual, human-like request patterns prevent detection as a bot.
Pro Tip: The LLM’s processing time (typically 2-5 seconds) acts as a built-in rate limiter, making your scraper more respectful and less likely to be blocked.
Ethical Considerations
- Always check a website’s
robots.txt
- Provide a user agent that identifies your scraper
- Avoid scraping sites that explicitly prohibit it
- Consider reaching out to website owners for permission or API access
Robust Error Handling and Text Cleaning
Handling messy, inconsistent text is a critical challenge in web scraping. Our text cleaning approach needs to be both robust and flexible:
class TextCleaner
def self.clean(text, options = {})
# Early return for nil or empty input
return "" if text.nil? || text.strip.empty?
# Ensure input is a string and clean it
cleaned_text = text.to_s.encode(
'UTF-8',
invalid: :replace, # Handle encoding issues
undef: :replace, # Replace undefined characters
replace: '' # Silently remove problematic characters
)
# Remove control characters and normalize whitespace
cleaned_text.tap do |t|
# Strip non-printable and control characters
t.gsub!(/[\u0000-\u001F\u007F\u2028\u2029]/, '')
# Normalize whitespace and line breaks
t.gsub!(/[,\r\n\t]+/, ' ')
t.gsub!(/\s+/, ' ')
end
# Optional advanced cleaning
if options[:max_length]
cleaned_text = cleaned_text[0, options[:max_length]]
end
# Optional HTML tag removal
if options[:strip_html]
cleaned_text.gsub!(/<[^>]*>/, '')
end
cleaned_text.strip
rescue StandardError => e
# Log and handle any unexpected errors
log_cleaning_error(text, e)
"" # Fail-safe empty string
end
private
def self.log_cleaning_error(original_text, error)
# Log detailed error information
error_log = {
original_text: original_text,
error_class: error.class.name,
error_message: error.message,
timestamp: Time.now
}
# Write to error log file
File.open('text_cleaning_errors.log', 'a') do |file|
file.puts JSON.pretty_generate(error_log)
end
end
end
# Practical usage in web scraping
class JobScraper
def extract_job_details(raw_html)
# Clean and process job description
description = TextCleaner.clean(
extract_description_from_html(raw_html),
max_length: 2000, # Limit to 2000 characters
strip_html: true # Remove any HTML tags
)
# Robust extraction with fallback
{
title: TextCleaner.clean(extract_title(raw_html)),
company: TextCleaner.clean(extract_company(raw_html)),
description: description
}
end
end
Robust Text Cleaning for Web Scraping
Effective web scraping requires sophisticated text processing. Our TextCleaner
provides a comprehensive solution for handling complex text extraction challenges:
Key Features:
-
Advanced Encoding Management
- Converts text to UTF-8 with intelligent error handling
- Safely manages diverse character sets and potential encoding issues
-
Intelligent Character Processing
- Removes non-printable and control characters
- Normalizes whitespace and line breaks
- Ensures clean, consistent text output
-
Flexible Cleaning Options
- Configurable text length limits
- Optional HTML tag stripping
- Customizable cleaning parameters
-
Comprehensive Error Handling
- Prevents processing failures
- Detailed error logging
- Fail-safe mechanisms to ensure data continuity
Practical Benefits
- Increased data reliability
- Consistent text formatting
- Reduced risk of processing errors
- Detailed debugging capabilities through comprehensive logging
By implementing these advanced text cleaning techniques, we create a more resilient and adaptable web scraping solution that can handle the complexities of real-world data extraction.
Handling Missing Fields
Sometimes job posts don’t include all the information we want:
def check_missing_fields(job)
missing = []
['company', 'location'].each do |field|
missing << field if job[field].nil? || job[field].strip.empty?
end
if missing.any?
system_prompt += " Additionally, please specifically look for these missing fields: #{missing.join(', ')}."
end
end
Data Storage
I use CSV files for storage, but you could easily adapt this for a database:
def setup_csv
FileUtils.mkdir_p('data')
CSV.open(DETAILS_CSV_PATH, 'w') do |csv|
csv << [
'id',
'title',
'company',
'location',
'salary_range',
'employment_type',
'description',
'requirements',
'benefits',
'processed_date'
]
end
end
Results and Benefits
After implementing this system, I’ve seen:
- 95% accuracy in extracting structured data
- Zero maintenance needed when sites update their HTML
- Ability to handle variations in how data is presented
- Cost savings from running the LLM locally
Setting Up Your Own System
- Download LM Studio from their website
- Download a suitable model (I use llama-3.2-3b-instruct)
- Start the local server
- Configure your scraper to use `http://127.0.0.1:1234/v1/chat/completions`
Conclusion
By combining traditional web scraping with local LLMs, we can build more resilient and intelligent data extraction systems. This approach not only reduces maintenance overhead but also improves the quality of extracted data while keeping costs low and data private.
The complete code is available in my GitHub repository, and you can adapt it for your own use cases.
If you have any questions or need help implementing this approach, feel free to reach out to me at blakelinkd@gmail.com.