Building Smart Web Scrapers with Local LLMs: Advanced HTML Cleaning Techniques
Building Smart Web Scrapers with Local LLMs: Using AI to Create Resilient Data Extraction Tools A guide to leveraging locally-run Large Language Models (LLMs) to build web scrapers that can adapt to changing website structures and extract data more reliably than traditional CSS selector-based approaches. When I first started building web scrapers, I kept running into the same problem: brittle CSS selectors. You spend hours crafting the perfect selectors, and then the website updates their class names or restructures their HTML.Tips for Building Apps with CursorAI
Maximizing Development Efficiency: Essential Tips for Building Applications with CursorAI I’ve just finished a complete Ruby on Rails app using CursorAI, and I’d like to share some tips on how to use CursorAI effectively when developing a Ruby on Rails app. There are a couple of things you can do to make building your app much faster, and I’ll give you a list of the things I’ve learned. Edit your .Fondly.ai - Professional Job Application Service: Your Personal Job Search Assistant
Take the Work Out of Looking for Work Job hunting is exhausting. Between searching for positions, customizing resumes, writing cover letters, and filling out applications, the process can feel like a full-time job itself. That’s why I created Fondly.ai – a service that matches you with the right opportunities while you focus on preparing for interviews. How Fondly.ai Enhances Your Job Search Smart Job Matching Our advanced matching system analyzes job postings to find positions that align with your experience, skills, and preferences, helping you focus on opportunities where you’re most likely to succeed.Finding a job with Python and Selenium Part 2
In Finding a job with Python and Selenium Part 1 we found a job board that was easy to scrape and saved the data to a local file. In this part we’ll learn how to load that data into a database and perform some basic analysis on it.
We need to initialize a sqlite database to store our data. We’ll create a table called jobs
with the following columns:
title
: The title of the joblocation
: The location of the jobdate
: The date the job was postedlink
: The link to the job postdescription
: The description of the jobhasapplied
: Whether we have applied for the job
import sqlite3
import os
# Database file path
db_file = 'jobs.db'
# SQL to create the jobs table
create_jobs_table_sql = '''
CREATE TABLE IF NOT EXISTS jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
companyname TEXT NOT NULL,
location TEXT,
date TEXT,
link TEXT,
description TEXT,
hasapplied INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create an index on companyname for faster lookups
CREATE INDEX IF NOT EXISTS idx_companyname ON jobs(companyname);
-- Create an index on hasapplied for faster filtering
CREATE INDEX IF NOT EXISTS idx_hasapplied ON jobs(hasapplied);
'''
def create_database():
# Check if database file already exists
db_exists = os.path.exists(db_file)
# Connect to the database (this will create it if it doesn't exist)
conn = sqlite3.connect(db_file)
cursor = conn.cursor()
# Create the jobs table
cursor.executescript(create_jobs_table_sql)
# Commit the changes and close the connection
conn.commit()
conn.close()
if db_exists:
print(f"Connected to existing database: {db_file}")
else:
print(f"Created new database: {db_file}")
print("Jobs table initialized successfully.")
if __name__ == "__main__":
create_database()
Avoiding Ghost Jobs
How I avoid applying to ghost jobs and job spam in general.
What Are Ghost Job Posts?
according to this article 50% of job listings do not result in the candidate being hired. This number is up from 20% in 2018. This is a huge problem if you’re seeking work the traditional way. These types of jobs are often referred to as “ghost jobs”.
How you can avoid ghost jobs
If you’re not a developer or a tech enthuiast, you might not be able to take this approach, but if you are a little familiar with programming concepts you may be able to use this technique.
A little SQL
I’m using PostgreSQL for this example, but the code should work with slight adjustments on other SQL based databases.
- Create a table to maintain the a unique list of companies and their statistics related our our job search.
CREATE TABLE public.unique_companies (
id serial4 NOT NULL,
companyname text NOT NULL,
job_count int4 NOT NULL,
last_updated timestamp DEFAULT CURRENT_TIMESTAMP NULL,
applications_count int4 DEFAULT 0 NULL,
CONSTRAINT unique_companies_companyname_key UNIQUE (companyname),
CONSTRAINT unique_companies_pkey PRIMARY KEY (id)
);
Finding a job with Python and Selenium
It takes a lot of work to find a job, in this post I’ll show you how you can automate much of the search using Python and Selenium. I’ll show you how to find job boards that are easy to scrape, we’ll save that data and analyze it so we can find the jobs we are most qualified for and most likely to score an interview for. We’ll also explore some methods of filtering jobs by how compatitble they are with our skillset using free AI tools such as Ollama or LM Studio, and we’ll use a LLM to perform keyword extraction to make searhing through job posts easier. We’ll then create a nice web application using React and Flask that will allow us to browse and search the jobs from all the job boards we have been scraping.
The Tools
Python
I like Python for doing this type of work because it’s simple and effective. If don’t know how to install and set up python you can check out this link to learn: Installing Python
Selenium
Python has built in options for extracting contents from web pages, however many job boards use JavaScript to render content inside the browser, and to capture this content we need a Browser Instance, and this is where Selenium comes in. Selenium let’s us create a real browser session and control our interactions with the page programatically.
SQLite
We have a LOT of data to store, so excel is not going to cut it. One database table I have has about a weeks worth of job posts stored in it and is over 100MB. SQLite will also allow us to make advanced queries against our data that we’ll need when it comes time to analyze and sort our jobs.