Data Mining & Web Scraping

I’m taking a data mining course this year. It’s fascinating new stuff and very applicable with all this buzz about “big data” and “BI.” Our professor requires we do a research project. So I got together with two friends and we started brainstorming about what we wanted to mine. We talked about some of the things we’d like to do predictive analysis on, like stocks, bonds or other securities. We thought about March Madness predictions. Then we settled on what we named “The Last Minute Vacation.” We haven’t decided exactly what we want to predict yet, but we are gathering data on airline flights and prices. One idea was to try and predict the best time to buy tickets last minute since there is a certain window of opportunity when tickets are really cheap. If this can be consistently predicted then could be cost efficient to wait ’til the last minute to buy tickets. We realize that much of this information is widely known; however, we thought it would be fun to analyze the data ourselves and see what other trends we might find.

We have since researched some basic web scraping methods and settled on a GUI based Ruby gem called Selenium. Basically, you setup a web driver object in Ruby and point it to a browser client like Firefox or Chrome and then tell it to get a website. We settled on Kayak because it makes things really simple.

First off, Kayak links are easy to hack. Actually, you don’t even have to hack them. They just spell it out for you. For example, if you want to fly from Salt Lake City to San Francisco on March 13, 2012 and return March 16, 2012, you can type in the following link: http://www.kayak.com/#/flights/SLC-SFO/2012-03-13/2012-03-16. If you don’t want to do  a round-trip flight, just lose the last date. If you want a different airport, change the 3 letter airport code. That simple.

To make things even easier, Kayak actually lists a link to download a CSV file of all the flights available for your search query. We spent about 10 minutes trying to figure out how to best scrape the necessary data from each query and then stumbled upon this jewel. As you can imagine, this significantly simplified our task. Thank you Kayak.

The tricky part has been making sure Kayak doesn’t block us out. To try and tackle this we’ve added random wait times. We’ve also run into errors parsing the site for the CSV. Sometimes the site loads too slow or the link just isn’t there for whatever reason. Most of this was easily handled with Ruby begin-rescue clauses. Now we are just trying to automate the scraping from a server so we don’t have to manually run it. This should be easy enough, except that the server we are going to use is not GUI based, so that’s a bit of a problem.

The script presently scrapes about 150 CSV files per run. We grab flight data for 30 consecutive days from the current date and then grab every 7th flight from their up to about 6 months out. Once we finish the scraping and run some analysis I’ll post some of our findings if anything worthwhile shows up.