Finding a job with Python and Selenium Part 2

In Finding a job with Python and Selenium Part 1 we found a job board that was easy to scrape and saved the data to a local file. In this part we’ll learn how to load that data into a database and perform some basic analysis on it.

We need to initialize a sqlite database to store our data. We’ll create a table called jobs with the following columns:

import sqlite3
import os

# Database file path
db_file = 'jobs.db'

# SQL to create the jobs table
create_jobs_table_sql = '''
CREATE TABLE IF NOT EXISTS jobs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    companyname TEXT NOT NULL,
    location TEXT,
    date TEXT,
    link TEXT,
    description TEXT,
    hasapplied INTEGER DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create an index on companyname for faster lookups
CREATE INDEX IF NOT EXISTS idx_companyname ON jobs(companyname);

-- Create an index on hasapplied for faster filtering
CREATE INDEX IF NOT EXISTS idx_hasapplied ON jobs(hasapplied);
'''

def create_database():
    # Check if database file already exists
    db_exists = os.path.exists(db_file)

    # Connect to the database (this will create it if it doesn't exist)
    conn = sqlite3.connect(db_file)
    cursor = conn.cursor()

    # Create the jobs table
    cursor.executescript(create_jobs_table_sql)

    # Commit the changes and close the connection
    conn.commit()
    conn.close()

    if db_exists:
        print(f"Connected to existing database: {db_file}")
    else:
        print(f"Created new database: {db_file}")
    print("Jobs table initialized successfully.")

if __name__ == "__main__":
    create_database()

Now that our database is set up let’s go ahead and modify our scraping code to save the data. Here is the part of the code that is extracting each job’s details from the board:

for div in job_divs:
    job_title = div.find('div', class_='matchValue title').text.strip()
    job_link = div.find('a')['href']
    job_location = div.find('div', class_='location-container').text.strip()
    job_date = div.find('div', class_='matchValue date').text.strip()
    base_url = "https://novatec-software.softgarden.io/"
    full_link = base_url + job_link
    jobs.append({'title': job_title, 'location': job_location, 'date': job_date, 'link': full_link})

At this point we don’t have access to the job description because that is stored in the job posts page. We can modify this loop to use the Selenium driver to navigate to the job’s page and collect the source for us. We won’t try to extract the description by looking at the elements on the page because this is not very future proof. Suppose we are looking at elements with a job_description class. There may be other job posts that follow a different template and the class name may be different. Simply extracting all the text from the page is sufficient for our needs. However, we will need to clean things up and extract the text from the markup. In our code we will have a clean_text method that will remove multiple spaces and newlines with a single space and a single newline.

Now our code will look like this:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
import logging

# Suppress selenium logging
logging.getLogger('selenium').setLevel(logging.ERROR)
logging.getLogger('urllib3').setLevel(logging.ERROR)

def setup_driver():
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument('--log-level=3')  # Only show fatal errors
    
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    return driver

# Initialize the Chrome driver
driver = setup_driver()

# Read the HTML content from the file
with open('sample_board.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all divs with id starting with 'job_id_'
job_divs = soup.find_all('div', id=lambda x: x and x.startswith('job_id_'))

# Extract job details
jobs = []
for div in job_divs:
    job_title = div.find('div', class_='matchValue title').text.strip()
    job_link = div.find('a')['href']
    job_location = div.find('div', class_='location-container').text.strip()
    job_date = div.find('div', class_='matchValue date').text.strip()
    base_url = "https://novatec-software.softgarden.io/"
    full_link = base_url + job_link
    
    # Get detailed job description using Selenium
    driver.get(full_link)
    time.sleep(6)  # Wait for page to load
    job_detail_html = driver.page_source
    detail_soup = BeautifulSoup(job_detail_html, 'html.parser')
    
    # Find the main content area (usually this contains the job description)
    job_description = detail_soup.find('body')
    
    # Clean the text content
    def clean_text(html_content):
        # Get text content
        text = html_content.get_text(separator=' ', strip=True)
        
        # Replace multiple spaces with single space
        text = ' '.join(text.split())
        
        # Replace multiple newlines with single newline
        text = '\n'.join(line.strip() for line in text.splitlines() if line.strip())
        
        return text
    
    description_text = clean_text(job_description) if job_description else 'No description available'
    
    jobs.append({
        'title': job_title, 
        'location': job_location, 
        'date': job_date, 
        'link': full_link,
        'description': description_text
    })

# Clean up
driver.quit()

# Print the extracted job details
for job in jobs:
    print(f"Title: {job['title']}")
    print(f"Location: {job['location']}")
    print(f"Date: {job['date']}")
    print(f"Link: {job['link']}")
    print(f"Description: {job['description']}...")
    print("\n---\n")

Note on using Selenium vs Requests It may be the case that we can use the requests library to grab the page source, but in the future we will run into the issue of content being generated by javascript. Selenium allows us to get this content. If you are sure that the content is not dynamically generated by javascript you can use the requests library with minor adjustments.

Run this code and we should see output that looks like this:

DevTools listening on ws://127.0.0.1:31836/devtools/browser/3879cd52-c7a6-46da-9ac5-b94046543b02
Title: Senior Full-stack Software Engineer (Angular)
Location: Spain
Date: 10/24/24
Link: https://novatec-software.softgarden.io/../job/49471363/Senior-Full-stack-Software-Engineer-Angular-?jobDbPVId=164121423&l=en
Description: View job here Senior Full-stack Software Engineer (Angular) Full Time Remote Spain With Professional Experience 10/24/24 🎮 Join the Quest at Novatec Software Engineering España! 🎮 📍 Location: Andalucia, Spain (Remote applicants from anywhere in Spain are welcome!) 🏆 Why Novatec? Imagine working in a place where your skills are not just valued but celebrated. At Novatec Software Engineering España , a proud branch of Novatec Consulting GmbH from Stuttgart, Germany, we turn passion for IT, agile software development, and open-source technology into impactful projects with top-tier clients. Our team is a vibrant mix of talents from over 10 countries, thriving on a management 3.0 mindset, agile methodologies, and minimal red tape. Every voice counts, and you have the freedom to shape your career and our success. 🎯 Your Mission: Elevate Fullstack Projects: Join a cutting-edge project and drive it forward with full-stack expertise. Master Front and Back-End: Use Angular for front-end development and Kotlin/Spring Boot for robust back-end services. Collaborate Globally: Work with a diverse, international team to deliver innovative solutions. Continuous Growth: Improve your skills in a dynamic, fun, and professional environment....

Now we have enough data to move on to the next step, which will be inserting the data into our database. We’ll cover that in the next part. Thanks!

If you have any questions or need help with the code, feel free to reach out to me at blakelinkd@gmail.com.