Building Smart Web Scrapers with Local LLMs: Advanced HTML Cleaning Techniques

Thu, Dec 12, 2024

Building Smart Web Scrapers with Local LLMs: Using AI to Create Resilient Data Extraction Tools

A guide to leveraging locally-run Large Language Models (LLMs) to build web scrapers that can adapt to changing website structures and extract data more reliably than traditional CSS selector-based approaches.

When I first started building web scrapers, I kept running into the same problem: brittle CSS selectors. You spend hours crafting the perfect selectors, and then the website updates their class names or restructures their HTML. Suddenly, your scraper breaks, and you’re back to square one. That’s why I started experimenting with locally running LLMs through LM Studio to make my scrapers more resilient.

Cleaning HTML: Preparing Content for Semantic Extraction

When scraping web pages, the raw HTML is often a messy cocktail of scripts, styles, and unnecessary markup. Before sending content to our local LLM, we need to strip away the noise and focus on the meaningful text. Here’s how I approach HTML cleaning:

def clean_html_for_llm(doc)
  # Remove script and style tags that bloat our context
  doc.css('script, style, link[rel="stylesheet"]').remove
  
  # Remove common non-content elements
  doc.css('header, footer, nav, aside, .sidebar, .advertisement').remove
  
  # Clean up remaining HTML
  doc.at('body').to_html
end

def prepare_content_for_extraction(url)
  # Use Selenium to render JavaScript-heavy pages
  @driver.get(url)
  
  # Parse the rendered page
  doc = Nokogiri::HTML(@driver.page_source)
  
  # Clean the HTML
  cleaned_content = clean_html_for_llm(doc)
  
  # Optionally, truncate very long content to prevent context overflow
  cleaned_content[0..15000]
end

Why This Matters

Traditional web scraping often fails because websites are complex, dynamic beasts. By carefully cleaning our HTML, we:

Reduce Noise: Remove scripts, styles, and navigation elements that distract our LLM
Improve Context Quality: Focus on the core content
Prevent Context Overflow: Ensure we don’t exceed token limits
Enhance Extraction Accuracy: Give the LLM a cleaner signal to work with

Practical Example

Let’s break down what’s happening:

script, style tags are pure noise for content extraction
Navigation, headers, and footers rarely contain job details
We use Nokogiri to parse and manipulate the HTML surgically
Selenium ensures we capture dynamically loaded content

Pro Tips

Always set a reasonable content length limit
Use CSS selectors to remove complex, nested elements
Consider website-specific cleaning rules for tricky job boards

By implementing smart HTML cleaning, we transform brittle web scraping into a robust, AI-powered data extraction system.

The Traditional Problem

Traditional web scrapers typically look something like this:

def scrape_job(doc)
  {
    title: doc.css('.job-title').text,
    company: doc.css('.company-name').text,
    location: doc.css('.location').text
  }
end

This works great… until it doesn’t. One website update and everything breaks.

A Better Way: Using Local LLMs

Instead of relying on specific HTML structures, we can use LLMs to understand the content semantically. Here’s how I do it:

class JobDetailsExtractor
  LM_STUDIO_URL = 'http://127.0.0.1:1234/v1/chat/completions'
  
  def process_page(url)
    # First, clean the HTML
    doc = Nokogiri::HTML(@driver.page_source)
    doc.css('script, style').remove
    
    # Send to local LLM
    response = make_lm_studio_request(doc.at('body').to_html)
    
    # Parse structured response
    parse_response(response)
  end
end

Structured Data Extraction with JSON Schema

The real magic happens in how we format our LLM request. Here’s the key part:

def make_lm_studio_request(content, missing_fields = [])
  request.body = {
    model: "llama-3.2-3b-instruct",
    messages: [
      {
        role: "system",
        content: "You are a job posting analyzer. Extract detailed information..."
      }
    ],
    response_format: {
      type: "json_schema",
      json_schema: {
        name: "job_details",
        strict: "true",
        schema: {
          type: "object",
          properties: {
            title: { type: "string" },
            company: { type: "string" },
            location: { type: "string" },
            salary_range: { type: "string" },
            employment_type: { type: "string" },
            description: { type: "string" },
            requirements: { type: "string" },
            benefits: { type: "string" },
            government_job: { type: "boolean" },
            urgent: { type: "boolean" }
          },
          required: ["title", "description", "requirements"]
        }
      }
    }
  }.to_json
end

By using the `json_schema` parameter, we get:

Consistent Output: Every response follows our defined structure
Type Safety: Fields come back in the correct format
Required Fields: We can specify which fields must be present
No Post-Processing: The data is ready to use

Making it Production Ready

Rate Limiting

To be a good web citizen, we need rate limiting:

MIN_REQUEST_INTERVAL = 3  # Minimum seconds between requests
MAX_REQUEST_INTERVAL = 5  # Maximum seconds between requests

def process_links
  jobs.each do |job|
    if @last_web_request_time
      elapsed = Time.now - @last_web_request_time
      sleep(MIN_REQUEST_INTERVAL - elapsed) if elapsed < MIN_REQUEST_INTERVAL
    end
    
    process_job(job)
    @last_web_request_time = Time.now
  end
end

Error Handling and Text Cleaning

We need robust error handling for both web requests and LLM processing:

def clean_text(text)
  return "" if text.nil?
  
  text
    .to_s
    .encode('UTF-8', invalid: :replace, undef: :replace, replace: '')
    .gsub(/[\u0000-\u001F\u007F\u2028\u2029]/, '')
    .gsub(/[,\r\n\t]+/, ' ')
    .gsub(/\s+/, ' ')
    .strip
end

Handling Missing Fields

Sometimes job posts don’t include all the information we want:

def check_missing_fields(job)
  missing = []
  ['company', 'location'].each do |field|
    missing << field if job[field].nil? || job[field].strip.empty?
  end
  
  if missing.any?
    system_prompt += " Additionally, please specifically look for these missing fields: #{missing.join(', ')}."
  end
end

Data Storage

I use CSV files for storage, but you could easily adapt this for a database:

def setup_csv
  FileUtils.mkdir_p('data')
  
  CSV.open(DETAILS_CSV_PATH, 'w') do |csv|
    csv << [
      'id',
      'title',
      'company',
      'location',
      'salary_range',
      'employment_type',
      'description',
      'requirements',
      'benefits',
      'processed_date'
    ]
  end
end

Results and Benefits

After implementing this system, I’ve seen:

95% accuracy in extracting structured data
Zero maintenance needed when sites update their HTML
Ability to handle variations in how data is presented
Cost savings from running the LLM locally

Setting Up Your Own System

Download LM Studio from their website
Download a suitable model (I use llama-3.2-3b-instruct)
Start the local server
Configure your scraper to use `http://127.0.0.1:1234/v1/chat/completions`

Conclusion

By combining traditional web scraping with local LLMs, we can build more resilient and intelligent data extraction systems. This approach not only reduces maintenance overhead but also improves the quality of extracted data while keeping costs low and data private.

The complete code is available in my GitHub repository, and you can adapt it for your own use cases.

If you have any questions or need help implementing this approach, feel free to reach out to me at blakelinkd@gmail.com.