# Data Mining Wrap-up: Cheap Airline Tickets

I never got around to summarizing the results of my data mining team’s research project. For those of you who never read the original post (it was one of my first), you might want to read about it first.

# Discussion

As we dug into our project more, we decided that we really wanted to answer three main questions:

1. What is the best day to purchase airline tickets?
2. How far in advance should we purchase tickets?
3. Are we getting the best price?

We continued using the Ruby Selenium driver for web scraping. We came up with a couple different ways to avoid getting blocked by Kayak.com, but none of our techniques worked flawlessly. It was sort of hit-and-miss oddly enough. Every now and again one of our IP’s would get blocked so we’d have to come up with an alternative.

We ended up scraping roughly 500,000 rows of data. We used a couple different randomized subsets of around 10,000 to 50,000 rows to make data processing faster.

From the CSVs we downloaded, we whittled down the number of input variables for our data mining algorithms. These are the ones we ended up using:

NAME TYPE DESCRIPTION
Airline Text Example: ‘AA’ is American Airlines
Arrive Text Arrival city as a three letter code
Arrive Time Integer Flight arrival time in military hours
Class Text Class of flight (e.g. business, coach, premium, mixed & first)
Departure Day Text Day of the week a flight departs
Difference Integer Number of days between downloading the information and the departure date
Duration Integer Number of minutes in flight
Price Integer Total cost for flight
Stops Total number of stops or layovers for a flight

Our outcome variable was Price, since we were trying to predict the price relative to the other input variables and how they affect price.

As far as algorithms, we settles on Artificial Neural Networks (ANN), K Nearest Neighbor (KNN) and Linear Regression.

# Results

What is the best day to purchase airline tickets? In answering our first question, we discovered that Tuesday was the best day to buy tickets, followed by Wednesday and Thursday. Sunday is the worst day to buy tickets, followed by Saturday, Friday and Monday. The caveat is that this is not true for every airline. This is the general conclusion we drew from our data.

How far in advance should you purchase tickets? Our data seemed to indicate that 60 days before your flight is the best time to buy tickets. When you start getting closer to the date of your flight, prices go up. When you buy your tickets earlier, prices tend to go up. Once again, this is not true for all airlines.

Are you getting the best price? As far our research, this was too hard to determine. We had too many variables and it was outside of the scope for our short research project.

Recommendations

If you wanted to do something similar to this (for school, for research or for fun), I would recommend you use less data in your research. This might sound obvious and it probably is to some, but it wasn’t to us at the time. We had all the data we needed; however, we included far too many variables in our algorithms to really find them useful or draw correlations. Many variables = obfuscated results.

Isolate the variables by using fewer variables. Focus on one airline. Focus on one airport. We wanted to know about multiple airlines and airports, but to get better results we should have studied them separately.

We also considered adding a “Buy/Buy Not” categorical variable so that we could run other algorithms that would determine whether to buy a ticket or not. This was out of scope and would have taken significant extra effort.

Lastly, we thought it would be worthwhile to collect data over a longer period of time (at least a year if not more) to detect seasonal and cyclical trends in ticket prices.

# For Your Perusal

You can download the slides for our presentation; however, they offer limited description since they are strictly presentation slides (virtually no text). There are a couple images of some of the graphs that were produced. I’ve extracted most of the useful images and data.

I also managed to find a version of our web scraping script, which you can download and modify to your liking as well.

# Data Mining & Web Scraping

I’m taking a data mining course this year. It’s fascinating new stuff and very applicable with all this buzz about “big data” and “BI.” Our professor requires we do a research project. So I got together with two friends and we started brainstorming about what we wanted to mine. We talked about some of the things we’d like to do predictive analysis on, like stocks, bonds or other securities. We thought about March Madness predictions. Then we settled on what we named “The Last Minute Vacation.” We haven’t decided exactly what we want to predict yet, but we are gathering data on airline flights and prices. One idea was to try and predict the best time to buy tickets last minute since there is a certain window of opportunity when tickets are really cheap. If this can be consistently predicted then could be cost efficient to wait ’til the last minute to buy tickets. We realize that much of this information is widely known; however, we thought it would be fun to analyze the data ourselves and see what other trends we might find.

We have since researched some basic web scraping methods and settled on a GUI based Ruby gem called Selenium. Basically, you setup a web driver object in Ruby and point it to a browser client like Firefox or Chrome and then tell it to get a website. We settled on Kayak because it makes things really simple.

First off, Kayak links are easy to hack. Actually, you don’t even have to hack them. They just spell it out for you. For example, if you want to fly from Salt Lake City to San Francisco on March 13, 2012 and return March 16, 2012, you can type in the following link: http://www.kayak.com/#/flights/SLC-SFO/2012-03-13/2012-03-16. If you don’t want to do  a round-trip flight, just lose the last date. If you want a different airport, change the 3 letter airport code. That simple.

To make things even easier, Kayak actually lists a link to download a CSV file of all the flights available for your search query. We spent about 10 minutes trying to figure out how to best scrape the necessary data from each query and then stumbled upon this jewel. As you can imagine, this significantly simplified our task. Thank you Kayak.

The tricky part has been making sure Kayak doesn’t block us out. To try and tackle this we’ve added random wait times. We’ve also run into errors parsing the site for the CSV. Sometimes the site loads too slow or the link just isn’t there for whatever reason. Most of this was easily handled with Ruby begin-rescue clauses. Now we are just trying to automate the scraping from a server so we don’t have to manually run it. This should be easy enough, except that the server we are going to use is not GUI based, so that’s a bit of a problem.

The script presently scrapes about 150 CSV files per run. We grab flight data for 30 consecutive days from the current date and then grab every 7th flight from their up to about 6 months out. Once we finish the scraping and run some analysis I’ll post some of our findings if anything worthwhile shows up.