In this project, our aim was to observe how the careers of workers in major gig-economy marketplaces evolve over time. By comparing the history of established workers to new entrants to the marketplace, we hoped to observe and document structural barriers to participation. Additionally, we planned to look at the career trajectories of workers stratified by gender and race, to understand whether discrimination caused workers from specific groups to succeed or fail (and possibly even drop out of the marketplace entirely) at different rates.
As a service to the community, we make the code and datasets from this study available to the public. This includes raw, crawled snapshots of data from Freelancer and People Per Hour, collected between July 2018 and October 2019, and parsed dataframes extracted from these snapshots.
We developed two straightforward crawlers in Python that crawled Freelancer and People per hour every few weeks. These crawlers relied on the publicly available sitemap.xml files to identify worker profiles and take snapshots of them, including the raw HTML of each worker's page and their profile image. Each crawl took multiple days to complete, given the size of the worker population on each platform and rate limits imposed by the platforms. Even with multi-day crawls, it was impossible to crawl all workers' profiles (2 million on People Per Hour and 25+ million on Freelancer), so our crawler sampled workers (between 20-40 thousand), using a strategy that tried to balance collecting data from long-term, well-established workers and new entrants to the marketplace. Week-to-week, the crawler tried to revisit the same workers to build up longitudinal data about their activities. The crawlers (<service>_xml.py) are both included in all of our archives.
The raw data is broken down by service (Freelancer or People Per Hour) and by the start date of the crawl. Each directory contains a file (profile_urls_all.txt) that lists the URLs of all worker profiles that the crawler identified as existing during the crawl; a second file (profile_urls_crawl.txt) list the URLs that were actually crawled, which is a subset of the former. Each directory contains a file (profiles.json.bz2) that contains the raw HTML from each workers' profile, wrapped in a JSON object that contains simple meta-data like the URL and the timestamp. Each JSON object is on one line of the this file. Lastly, each directory contains a file (profile_imgs.tar) that contains all of the workers' profile images that the crawler was able to identify and download.
Scripts for parsing the data into Parquet-format dataframes (<service>_parse_to_parquet.py) are both included in all our archives. These scripts are useful to understand what data can be extracted from workers' profiles, the necessary selectors to grab this data, and the column names and data types in the resulting Parquet files. Note that these scripts depend on Apache Spark to execute, i.e., the volume of data is such that parsing must be parallelized to complete in a reasonable amount of time.
For each service we provide two dataframes: <service>_raw.parquet and <service>_combined.parquet. The former dataframe contains one row for each (worker, timestamp) tuple in the raw data, i.e., each row corresponds to one snapshot in time of a worker collected by the crawler. Thus, there are potentially multiple rows corresponding to each worker, as the crawler may have visited each worker multiple times. The latter dataframe contains one row per worker, i.e., it combines the individual snapshots for each worker into one row by aggregating information over time. The parsing scripts (<service>_parse_to_parquet.py) contain the meta-data about the column names and data types in the dataframes.
We make four compressed archives available. Two archives contain the complete datasets, crawling scripts, and parsing scripts for Freelancer and People Per Hour, respectively. Warning: these archives are tens of gigabytes large. The remaining two archives only contain the parsed dataframes in Parquet format and the associated Python scripts. These archives are megabytes in size.