Webscraping Ironman Triathalon Results

Introduction

Ironman race results data was scraped from a third-party website for the purpose of EDA. All the data and code used for extracting results data can be found in the following links:

GitHub Repository here
Kaggle Dataset here

The following files contained in the Github (mainly the Jupyter notebook and Python script) were used to scrape 140.6 Ironman race results ranging from 2002 to 2024 (as of 12-05-2024). Note, the data was not scraped from the official Ironman website, but a proxy-website not owned by Ironman.

The notebook and scripts were designed in a way that generated 3 CSVs that follow the format of a standard relational database and can be joined together utilizing various IDs and could be readily uploaded to a SQL database.

How to run

Web scraping begins in IronMan Scraping (Scrape Only).ipynb. All cells should be run in order from top to bottom and should result in series and race data being generated.

The case for parallel processing

Individual race results data is dynamically loaded onto the webpage meaning that BeautifulSoup4, which was used to scrape other information from the website, cannot be used here. Instead, Selenium allows for automated browser testing (and in this scenario, actual data loading). The main caveat here is that the process of dynamically loading result data and launching a physical browser onto the computer is not only slow but painfully slow testing showed that on average, Selenium was able to process ~12 rows of data per second and with over 1,000,000 rows of data, this process would have taken around 25 hours to complete furthermore, utilizing the notebook to scrape results data meant that at any point if the script failed, the cell would have to be restarted which could be problematic especially with these long processing times.

Instead, a Python script utilizing the subprocess package was utilized to launch multiple browsers and scrape races simultaneously. There are 2 .py files master.py and worker.py.

  • Master.py functions as the coordinator of web scraping and launches the instances of worker.py. Depending on the number of subprocesses requested, the script will partition the work equally (based on the number of races, not the number of rows to scrape) and kick off instances of worker.py. In the event that there are races that have already been scraped and a CSV file has already been generated, master.py will skip these races and removed them from consideration for scraping.

  • Worker.py is the script that actually handles browser launching, web scraping, and file generation. The script is passed index information from Master.py for what races the script will be responsible for and iterates through the list. There are instances in which Selenium may unexpectedly fail to scrape web pages and the worker instance may immediately exit. In these scenarios, master.py was rerun.

Below is an example of how the script can be used along with how to specify how many worker (in this case, 8) instances you want:

1python master.py --num_workers 8

Note, that increasing the number of workers does not necessarily linearly increase web scraping performance. There are diminishing returns for launching more browsers simultaneously, as around 10 workers dropped individual worker performance down to ~5 rows/second. Worker scripts will generate CSV files and place them in the following file path: "./IronManData/raceResultsData/". The Python notebook contains code to combine all these CSVs and place them in the same file path as the races and series data.

 1#After each individual csv data has been created, they need to be combined into a single csv as our "master" csv
 2
 3# Directory containing the CSV files
 4directory = './IronManData/raceResultsData'
 5
 6# Initialize an empty list to store individual DataFrames
 7data_frames = []
 8
 9# Iterate through all CSV files in the directory
10for filename in os.listdir(directory):
11    if filename.endswith('.csv'):
12        file_path = os.path.join(directory, filename)
13        # Read the CSV file
14        df = pd.read_csv(file_path)
15        # Append the DataFrame to the list
16        data_frames.append(df)
17
18# Concatenate all DataFrames in the list
19combined_df = pd.concat(data_frames, ignore_index=True)
20
21# Write the combined DataFrame to a new CSV file
22combined_df.to_csv('./IronManData/sql/results.csv', index=False)
23
24print("All CSV files combined successfully!")

Resulting Dataset

3 CSVs should be generated at this point and should contain all the relevant information about Ironman results from 2002-2024.

Series.csv

ColumnDescription
idUnique identifier for the series (Primary Key)
locationLocation of the series
continentContinent where the series is held
linkURL link to the series details

Races.csv

ColumnDescription
yearYear of the race
linkURL link to the race details
totalkonaSlotsTotal Kona slots available
maleKonaSlotsKona slots available for males
femaleKonaSlotsKona slots available for females
male1stTime of the first male finisher
female1stTime of the first female finisher
finishersTotal number of finishers
dnfNumber of Did Not Finish (DNF)
dqNumber of Disqualifications (DQ)
idUnique identifier for the race (Primary Key)
seriesIDIdentifier for the series (Foreign Key)

Results.csv

ColumnDescription
bibBib number of the participant
nameName of the participant
athleteLinkURL link to the athlete's profile
countryCountry of the participant
genderGender of the participant
divisionDivision category of the participant
divLinkURL link to the division details
divisionRankRank of the participant in their division
overallTimeTotal time taken by the participant
overallRankOverall rank of the participant
swimTimeSwim time of the participant
swimRankSwim rank of the participant
bikeTimeBike time of the participant
bikeRankBike rank of the participant
runTimeRun time of the participant
runRankRun rank of the participant
finishStatusFinish status of the participant
dnfDid Not Finish status
raceIDIdentifier for the race (Foreign Key)
athleteIDIdentifier for the athlete