Finding a job with Python and Selenium

Sat, Oct 19, 2024

It takes a lot of work to find a job, in this post I’ll show you how you can automate much of the search using Python and Selenium. I’ll show you how to find job boards that are easy to scrape, we’ll save that data and analyze it so we can find the jobs we are most qualified for and most likely to score an interview for. We’ll also explore some methods of filtering jobs by how compatitble they are with our skillset using free AI tools such as Ollama or LM Studio, and we’ll use a LLM to perform keyword extraction to make searhing through job posts easier. We’ll then create a nice web application using React and Flask that will allow us to browse and search the jobs from all the job boards we have been scraping.

The Tools

Python

I like Python for doing this type of work because it’s simple and effective. If don’t know how to install and set up python you can check out this link to learn: Installing Python

Selenium

Python has built in options for extracting contents from web pages, however many job boards use JavaScript to render content inside the browser, and to capture this content we need a Browser Instance, and this is where Selenium comes in. Selenium let’s us create a real browser session and control our interactions with the page programatically.

SQLite

We have a LOT of data to store, so excel is not going to cut it. One database table I have has about a weeks worth of job posts stored in it and is over 100MB. SQLite will also allow us to make advanced queries against our data that we’ll need when it comes time to analyze and sort our jobs.

Run the following command, and you should see the version of Python you have installed:

python --version
Python 3.11.7

Now we can install the libraries we’ll need

pip install requests beautifulsoup4 selenium webdriver-manager

Now I’ll create a new folder for my project. I name mine scraper but you can do whatever you want.

Job boards

Companies use many different sites to post their jobs. Some are easier to scrape than others and some have sophisticated automation detection(like indeed.com) and will make it very difficult to scrape. We’ve got to find a board that does not have this restriction, and also we need to keep in mind that many of the sites have rate limiters, so when we are collecting data we need to do it at a slow pace so we don’t overload the server. For the sake of this demo I’m going to pick the site softgarden.io. I search google with a query like: site:softgarden.io Senior Java Developer and I see this link:

Google Search Result

Cool, I am one of the genders so this is great. Notice the subdomain serviceplan in the link https://serviceplan.softgarden.io/job/49387133/Senior-Full-Stack-Developer-all-genders/?l=en This is usually a url-friendly version of the companies name. We also see other domains such as https://inform-software.softgarden.io/. So, if you want to find if a company has jobs posted on softgarden.io you could query google with site:softgarden.io Novatec job and we might see something like this:

Novatec Link

When we visit the link we see that particular job post. We want all of the posts by this particular company. To do this we’ll just knock everything off the url that comes after the domain and get a link like: Novatec Software Jobs

Cool, now we can see all of their job posts:

Novatec board

Let’s have a look at the page source and see what the HTML of these list items looks. We can right click on one of the Job links and select Inspect. This opens up the dev tools to the underlying code.

<div class="matchElement odd" style="undefined" id="job_id_49993923"></div>

We see each job is wrapped in a div that has an id attribute like job_id_49993923. This gives us a good target for scraping. Copy all the code from the View Source tab and save it to the file sample_board.html. We’ll use this file to work on instead of loading the page over and over as we work on our script.

Now let’s write some Python code to scrape the job details:

from bs4 import BeautifulSoup

# Read the HTML content from the file
with open('sample_board.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all divs with id starting with 'job_id_'
job_divs = soup.find_all('div', id=lambda x: x and x.startswith('job_id_'))

# Extract job details
jobs = []
for div in job_divs:
    job_title = div.find('div', class_='matchValue title').text.strip()
    job_link = div.find('a')['href']
    job_location = div.find('div', class_='location-container').text.strip()
    job_date = div.find('div', class_='matchValue date').text.strip()
    base_url = "https://novatec-software.softgarden.io/"
    full_link = base_url + job_link
    jobs.append({'title': job_title, 'location': job_location, 'date': job_date, 'link': full_link})

# Print the extracted job details
for job in jobs:
    print(job)

When you run the code you should see results like:

{'title': 'IT Talent Acquisition Specialist', 'location': 'Granada', 'date': '10/11/24', 'link': 'https://novatec-software.softgarden.io/../job/49993923/IT-Talent-Acquisition-Specialist-?jobDbPVId=160708533&l=en'}
{'title': 'Database Engineer', 'location': 'Spain', 'date': '10/11/24', 'link': 'https://novatec-software.softgarden.io/../job/49990373/Database-Engineer?jobDbPVId=160698573&l=en'}
{'title': 'Full-stack Software Engineer', 'location': 'Spain', 'date': '9/27/24', 'link': 'https://novatec-software.softgarden.io/../job/49472208/Full-stack-Software-Engineer?jobDbPVId=157409078&l=en'}

This is a great start. At this point you would be able to collect a lot of job posts and filter them by the Job’s title, but we’re going to go further. In part 2 we’ll set up the database and we’ll also figure out how we’re going to extract the job descriptions from the posts.

If you have any questions email me at blakelinkd@gmail.com.