I intended to scrape only monster.com, which is one of the best job search websites, but changed my mind when I found out most of posted jobs does not contain salary info. It’s not fun at all, so I chose glassdoor.com as my second target website.
Overview
In this project, I mainly focused on three jobs(Data Analyst, Web Developer and Mobile APP Developer) in monster website(see full code on https://github.com/A3M4/Web-Scraping-Job-Info-on-Monster.com-by-Scrapy). The diagram below shows the workflow of collecting structured raw data from web pages, selecting some of that data, parsing/visualizing data.
Data Flow in Scrapy
The diagram below, which is credit to https://docs.scrapy.org, shows the steps of data transfer. Firstly, a line of command(scrapy crawl monster-spider) is given by user in Terminal of spider’s directory and sent to ENGINE, then ENGINE will schedule the Requests in SCHEDULER. Requests will be arranged and fed into ENGINE and DOWNLOADER acquires the requests through downloader MIDDLEWARE(add functionality to the downloader system, e.g. proxies, user-agent headers, etc.). Once downloading finishes, pages and responses will be sent back to ENGINE and then to SPIDERS via spiders MIDDLEWARE(filtering out responses with bad HTTP status codes). SPIDERS will process responses and return scraped items and new requests to ENGINE, then processed items will be sent to ITEM PIPELINES, where the scraped items in this project are saved into a database. Finally, SCHEDULER gets processed requests and asks new requests from ENGINE.
Assigning Attributes to the Spider
After setting up the environment and creating a Scrapy project, codes in the monster_spider.py file are as follows:
1 | class MonsterSpider(scrapy.Spider): |
Usually start_urls class attribute is defined as the homepage of the website where the spider will begin to crawl(e.g. https://www.monster.com/jobs/search/?q=Data-Analyst) and then be used by the default implementation of start_requests() to generate initial requests. Yet a URL looking like server-side web API was found when browsing through web pages by using Chrome DevTools.
It returns information in well-structured JSON format, each page contains 25-30 job positions and one of them is shown in the diagram below(edited by JSON Beautifier), which is clearer to parse than using XPath or Css selector to locate web elements.
Obviously, each durian job is specific to a particular MusangKingId and a postingId. Both IDs work well but not all jobs have MusangKingId. Therefore a list of this URLs are assigned to start_urls in order to get postingId, jobtitle and number of pages are defined by user input.
Parse()
1 | def parse(self, response): |
Postingid locating at the middle of texts of ImpressionTracking can be easily grabbed by using Regex. Similarly, next_url, containing detailed single job information in JSON format, was found by using DevTools. The part of data it contains, as shown in following picture, is very neat except for the tags and html comments in jobdescription part. This can be removed by using BeautifulSoup and bs4.element.
In the last line of parse(), it returns a generator and next_url is passed to parse_detail().
Parse_detail()
1 | def parse_detail(self, response): |
Since all data are in JSON format, the code for extracting needed data is clean and simple. After this, spider returns the extracted data(python dictionary format) to containers(class MonsterItem in items.py file)
Storing data in SQLite3 Database
In order to export data to SQLite, we need to connect with database and create table in pipelines.py. An elaborate introduction about pipelines can be found in docs.scrapy.org :
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
Typical uses of item pipelines are:
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database
1 | class MonsterPipeline(object): |
The block of code above is an example for storing only job title, create_connection() and create_table() are implement in the init function, they will do their jobs as the name implies whenever the class MonsterPipeline() is initialized.
To ensure pipeline works, a block of default code shown below in settings.py needs to be uncomment.
1 | ITEM_PIPELINES = { |
After opening the generated database file in SQLite, it looks like the table below.
Anti-scraping Techniques
Monster.com do not need CAPTCHA to have access to any data, so the spider only uses three anti-scraping methods, which are varying scraping speed, random User-Agent provided by scrapy-fake-useragent(https://github.com/hyan15/scrapy-proxy-pool) and rotating IP provided by scrapy-proxy-pool(https://pypi.org/project/scrapy-fake-useragent/). There are detailed information on the links for the last two libraries.
Glassdoor Spider
Since the site structure of glassdoor is somewhat different with monster, I built another spider based on Scrapy and Urllib. See full code on https://github.com/A3M4/Glassdoor.com-Job-Info-Web-Crawler-by-Scrapy The spider is connected with MySQL Workbench, its table for data analyst is shown below and the salary range that employer listed is separated into two columns: salarylow and salryhigh. Besides, the meaning of size and year are “number of employees” and “company founded year”.
Data Visualization
The visualizations were created by using Tableau and Matplotlib, Tableau is a fantastic software for making nice presentation and basic analysis, but for heavy lifting, visualizing libraries in R or Python are definitely my first choice. The following image provides basic info(average base salary, star rating by employees and company founded year) for these three positions.