Data Mining Wrap-up: Cheap Airline Tickets

I never got around to summarizing the results of my data mining team’s research project. For those of you who never read the original post (it was one of my first), you might want to read about it first.

Discussion

As we dug into our project more, we decided that we really wanted to answer three main questions:

  1. What is the best day to purchase airline tickets?
  2. How far in advance should we purchase tickets?
  3. Are we getting the best price?

We continued using the Ruby Selenium driver for web scraping. We came up with a couple different ways to avoid getting blocked by Kayak.com, but none of our techniques worked flawlessly. It was sort of hit-and-miss oddly enough. Every now and again one of our IP’s would get blocked so we’d have to come up with an alternative.

We ended up scraping roughly 500,000 rows of data. We used a couple different randomized subsets of around 10,000 to 50,000 rows to make data processing faster.

From the CSVs we downloaded, we whittled down the number of input variables for our data mining algorithms. These are the ones we ended up using:

NAME TYPE DESCRIPTION
Airline Text Example: ‘AA’ is American Airlines
Arrive Text Arrival city as a three letter code
Arrive Time Integer Flight arrival time in military hours
Class Text Class of flight (e.g. business, coach, premium, mixed & first)
Departure Day Text Day of the week a flight departs
Difference Integer Number of days between downloading the information and the departure date
Download Day Text Day of the week flight data was downloaded
Duration Integer Number of minutes in flight
Price Integer Total cost for flight
Stops Total number of stops or layovers for a flight

Our outcome variable was Price, since we were trying to predict the price relative to the other input variables and how they affect price.

As far as algorithms, we settles on Artificial Neural Networks (ANN), K Nearest Neighbor (KNN) and Linear Regression.

Results

What is the best day to purchase airline tickets? In answering our first question, we discovered that Tuesday was the best day to buy tickets, followed by Wednesday and Thursday. Sunday is the worst day to buy tickets, followed by Saturday, Friday and Monday. The caveat is that this is not true for every airline. This is the general conclusion we drew from our data.

Best Day To Buy Airline Tickets

How far in advance should you purchase tickets? Our data seemed to indicate that 60 days before your flight is the best time to buy tickets. When you start getting closer to the date of your flight, prices go up. When you buy your tickets earlier, prices tend to go up. Once again, this is not true for all airlines.

When to Buy Airline Tickets

Are you getting the best price? As far our research, this was too hard to determine. We had too many variables and it was outside of the scope for our short research project.

Recommendations

If you wanted to do something similar to this (for school, for research or for fun), I would recommend you use less data in your research. This might sound obvious and it probably is to some, but it wasn’t to us at the time. We had all the data we needed; however, we included far too many variables in our algorithms to really find them useful or draw correlations. Many variables = obfuscated results.

Isolate the variables by using fewer variables. Focus on one airline. Focus on one airport. We wanted to know about multiple airlines and airports, but to get better results we should have studied them separately.

We also considered adding a “Buy/Buy Not” categorical variable so that we could run other algorithms that would determine whether to buy a ticket or not. This was out of scope and would have taken significant extra effort.

Lastly, we thought it would be worthwhile to collect data over a longer period of time (at least a year if not more) to detect seasonal and cyclical trends in ticket prices.

For Your Perusal

You can download the slides for our presentation; however, they offer limited description since they are strictly presentation slides (virtually no text). There are a couple images of some of the graphs that were produced. I’ve extracted most of the useful images and data.

I also managed to find a version of our web scraping script, which you can download and modify to your liking as well.