Building Smart Web Scrapers with Local LLMs

Thu, Dec 12, 2024

Building Smart Web Scrapers with Local LLMs: Using AI to Create Resilient Data Extraction Tools

A guide to leveraging locally-run Large Language Models (LLMs) to build web scrapers that can adapt to changing website structures and extract data more reliably than traditional CSS selector-based approaches.

When I first started building web scrapers, I kept running into the same problem: brittle CSS selectors. You spend hours crafting the perfect selectors, and then the website updates their class names or restructures their HTML. Suddenly, your scraper breaks, and you’re back to square one. That’s why I started experimenting with locally running LLMs through LM Studio to make my scrapers more resilient.

The Traditional Problem

Traditional web scrapers typically look something like this:

def scrape_job(doc)
  {
    title: doc.css('.job-title').text,
    company: doc.css('.company-name').text,
    location: doc.css('.location').text
  }
end

This works great… until it doesn’t. One website update and everything breaks.

A Better Way: Using Local LLMs

Instead of relying on specific HTML structures, we can use LLMs to understand the content semantically. Here’s how I do it:

class JobDetailsExtractor
  LM_STUDIO_URL = 'http://127.0.0.1:1234/v1/chat/completions'
  
  def process_page(url)
    # First, clean the HTML
    doc = Nokogiri::HTML(@driver.page_source)
    doc.css('script, style').remove
    
    # Send to local LLM
    response = make_lm_studio_request(doc.at('body').to_html)
    
    # Parse structured response
    parse_response(response)
  end
end

Structured Data Extraction with JSON Schema

The real magic happens in how we format our LLM request. Here’s the key part:

def make_lm_studio_request(content, missing_fields = [])
  request.body = {
    model: "llama-3.2-3b-instruct",
    messages: [
      {
        role: "system",
        content: "You are a job posting analyzer. Extract detailed information..."
      }
    ],
    response_format: {
      type: "json_schema",
      json_schema: {
        name: "job_details",
        strict: "true",
        schema: {
          type: "object",
          properties: {
            title: { type: "string" },
            company: { type: "string" },
            location: { type: "string" },
            salary_range: { type: "string" },
            employment_type: { type: "string" },
            description: { type: "string" },
            requirements: { type: "string" },
            benefits: { type: "string" },
            government_job: { type: "boolean" },
            urgent: { type: "boolean" }
          },
          required: ["title", "description", "requirements"]
        }
      }
    }
  }.to_json
end

By using the `json_schema` parameter, we get:

Consistent Output: Every response follows our defined structure
Type Safety: Fields come back in the correct format
Required Fields: We can specify which fields must be present
No Post-Processing: The data is ready to use

Making it Production Ready

Rate Limiting and Ethical Scraping

When building web scrapers, being a good web citizen is crucial. With LLM-powered scrapers, we have a unique advantage: the inherent processing time of the language model can naturally introduce beneficial rate limiting.

class WebScraper
  # Configurable rate limiting parameters
  MIN_REQUEST_INTERVAL = 3   # Minimum seconds between requests
  MAX_REQUEST_INTERVAL = 5   # Maximum seconds between requests
  LLM_PROCESSING_BUFFER = 2  # Additional buffer for LLM processing time

  def initialize
    @last_web_request_time = nil
    @request_count = 0
    @max_requests_per_session = 50  # Prevent excessive scraping
  end

  def process_links(jobs)
    jobs.each do |job|
      # Check total request limit
      break if @request_count >= @max_requests_per_session

      # Enforce minimum time between requests
      enforce_rate_limit

      # Process the job with LLM
      process_job(job)

      # Track request metrics
      @request_count += 1
      @last_web_request_time = Time.now
    end
  end

  private

  def enforce_rate_limit
    if @last_web_request_time
      elapsed = Time.now - @last_web_request_time
      
      # Factor in LLM processing time as a natural delay
      sleep_time = [
        MIN_REQUEST_INTERVAL - elapsed, 
        LLM_PROCESSING_BUFFER
      ].max

      sleep(sleep_time) if sleep_time > 0
    end
  end

  def log_scraping_activity(job)
    # Optional: Log scraping activities for monitoring
    File.open('scraping_log.txt', 'a') do |file|
      file.puts "#{Time.now}: Processed job - #{job['title']}"
    end
  end
end

Rate Limiting Strategies

The beauty of LLM-powered web scraping is that the model’s processing time naturally introduces a delay between requests. This has several benefits:

Reduced Server Load: By spacing out requests, we minimize the impact on target websites.
Avoiding IP Blocks: Gradual, human-like request patterns prevent detection as a bot.

Pro Tip: The LLM’s processing time (typically 2-5 seconds) acts as a built-in rate limiter, making your scraper more respectful and less likely to be blocked.

Ethical Considerations

Always check a website’s robots.txt
Provide a user agent that identifies your scraper
Avoid scraping sites that explicitly prohibit it
Consider reaching out to website owners for permission or API access

Robust Error Handling and Text Cleaning

Handling messy, inconsistent text is a critical challenge in web scraping. Our text cleaning approach needs to be both robust and flexible:

class TextCleaner
  def self.clean(text, options = {})
    # Early return for nil or empty input
    return "" if text.nil? || text.strip.empty?

    # Ensure input is a string and clean it
    cleaned_text = text.to_s.encode(
      'UTF-8', 
      invalid: :replace,   # Handle encoding issues
      undef: :replace,     # Replace undefined characters
      replace: ''          # Silently remove problematic characters
    )

    # Remove control characters and normalize whitespace
    cleaned_text.tap do |t|
      # Strip non-printable and control characters
      t.gsub!(/[\u0000-\u001F\u007F\u2028\u2029]/, '')
      
      # Normalize whitespace and line breaks
      t.gsub!(/[,\r\n\t]+/, ' ')
      t.gsub!(/\s+/, ' ')
    end

    # Optional advanced cleaning
    if options[:max_length]
      cleaned_text = cleaned_text[0, options[:max_length]]
    end

    # Optional HTML tag removal
    if options[:strip_html]
      cleaned_text.gsub!(/<[^>]*>/, '')
    end

    cleaned_text.strip
  rescue StandardError => e
    # Log and handle any unexpected errors
    log_cleaning_error(text, e)
    ""  # Fail-safe empty string
  end

  private

  def self.log_cleaning_error(original_text, error)
    # Log detailed error information
    error_log = {
      original_text: original_text,
      error_class: error.class.name,
      error_message: error.message,
      timestamp: Time.now
    }

    # Write to error log file
    File.open('text_cleaning_errors.log', 'a') do |file|
      file.puts JSON.pretty_generate(error_log)
    end
  end
end

# Practical usage in web scraping
class JobScraper
  def extract_job_details(raw_html)
    # Clean and process job description
    description = TextCleaner.clean(
      extract_description_from_html(raw_html),
      max_length: 2000,  # Limit to 2000 characters
      strip_html: true   # Remove any HTML tags
    )

    # Robust extraction with fallback
    {
      title: TextCleaner.clean(extract_title(raw_html)),
      company: TextCleaner.clean(extract_company(raw_html)),
      description: description
    }
  end
end

Robust Text Cleaning for Web Scraping

Effective web scraping requires sophisticated text processing. Our TextCleaner provides a comprehensive solution for handling complex text extraction challenges:

Key Features:

Advanced Encoding Management
- Converts text to UTF-8 with intelligent error handling
- Safely manages diverse character sets and potential encoding issues
Intelligent Character Processing
- Removes non-printable and control characters
- Normalizes whitespace and line breaks
- Ensures clean, consistent text output
Flexible Cleaning Options
- Configurable text length limits
- Optional HTML tag stripping
- Customizable cleaning parameters
Comprehensive Error Handling
- Prevents processing failures
- Detailed error logging
- Fail-safe mechanisms to ensure data continuity

Practical Benefits

Increased data reliability
Consistent text formatting
Reduced risk of processing errors
Detailed debugging capabilities through comprehensive logging

By implementing these advanced text cleaning techniques, we create a more resilient and adaptable web scraping solution that can handle the complexities of real-world data extraction.

Handling Missing Fields

Sometimes job posts don’t include all the information we want:

def check_missing_fields(job)
  missing = []
  ['company', 'location'].each do |field|
    missing << field if job[field].nil? || job[field].strip.empty?
  end
  
  if missing.any?
    system_prompt += " Additionally, please specifically look for these missing fields: #{missing.join(', ')}."
  end
end

Data Storage

I use CSV files for storage, but you could easily adapt this for a database:

def setup_csv
  FileUtils.mkdir_p('data')
  
  CSV.open(DETAILS_CSV_PATH, 'w') do |csv|
    csv << [
      'id',
      'title',
      'company',
      'location',
      'salary_range',
      'employment_type',
      'description',
      'requirements',
      'benefits',
      'processed_date'
    ]
  end
end

Results and Benefits

After implementing this system, I’ve seen:

95% accuracy in extracting structured data
Zero maintenance needed when sites update their HTML
Ability to handle variations in how data is presented
Cost savings from running the LLM locally

Setting Up Your Own System

Download LM Studio from their website
Download a suitable model (I use llama-3.2-3b-instruct)
Start the local server
Configure your scraper to use `http://127.0.0.1:1234/v1/chat/completions`

Conclusion

By combining traditional web scraping with local LLMs, we can build more resilient and intelligent data extraction systems. This approach not only reduces maintenance overhead but also improves the quality of extracted data while keeping costs low and data private.

The complete code is available in my GitHub repository, and you can adapt it for your own use cases.

If you have any questions or need help implementing this approach, feel free to reach out to me at blakelinkd@gmail.com.