top of page

WES WARRINER

Portfolio

Athletic.Net - XC Statistics

Data Collection Modules

This project is a work-in-progress, and I'll probably add more as I have time, but so far here's what we've got:

  • Web-scraping with Beautiful Soup 4 and Scrapy

  • Piping scraped web data to a MySQL database

  • Pulling filtered data from our database for external analysis

Bs4 Athletic.Net Scrape - 300 ID Sample

I wanted to do a quick demo with Bs4 as it is a user-friendly, very popular web-scraping package, so the following is a demonstration of extracting athletes' seasonal records in cross country events from 300 athletic.net urls, followed by several database operations with sqlalchemy

XC Times to sqllite Database

Taking our scrape from above (saved as a csv) and pushing it to a local database table with sqlalchemy

Scrapy->MySQL Pipeline

Alright, let's get serious.

I included the above modules primarily for the sake of example.

The below is the spider module for a more efficient webscrape using a scrapy pipeline into a MySQL database.

Obviously not all python code is shown, as this project required a number of separate files to set up, so I'll just include the spider that I used to crawl athletic.net for season records, followed by some of the results in MySQL and then Excel tables.

XC Season Records Spider (Scrapy)

Viewing our Aggregated Data

And the last installation in this first segment of my XC statistics project is a quick look at our results, first in the database itself (via the cmd line), and then briefly in an Excel table. This serves as a reminder that we can now access and quickly work with these data across a variety of analysis tools, whereas before they were dispersed as html across millions of separate webpages. This is the power of data collection and aggregation; this is the utility of databases.

Above I've just selected all entries in our database (at the moment I have scraped 1,000,000 urls for athlete stats, more to come) where the athlete name is 'Alex Johnson', a very common name.

While it is possible to tell each individual apart by the entry's primary key ('id') and by looking at the years/grades for each of their records, it would be convenient to have more information to tell each runner apart.

That is why I've scraped athlete info into a separate table in the database, related (hence the term, 'relational database') to the above 'records' table by 'athlete_name'~'name'. This table includes athlete name, sex, and the school they run for.

Below I've taken 5,000m records from the first 10,000 entries in our MySQL records table (via pandas' read_sql function through a sqlalchemy-constructed engine) and am viewing them in an Excel spreadsheet.

Just a simple table with a column added to display a given record's percentile rank with relation to the whole set of 5k records. This metric by itself doesn't say too much. For instance it might be more interesting to group by grade, or even grade and year simultaneously, in order to view athlete percentile ranks among their classmates. However I'd like to finish gathering all the data before doing serious analysis and trying to draw conclusions. I still have tens of millions of pages to crawl before we get there ;).

Nevertheless, I believe this introduction to collecting and organizing scattered statistical data from the web gives a decent taste of the opportunities available, given the right skills and tools

The following are just a few quick samples to provide some evidence for my comprehension of SQL queries, and demo some simple sqlalchemy.

Sqlalchemy Demo - XC Stats Subset

BeautifulSoup &  SQLAlchemy

bottom of page