Data Mining Wrap-up: Cheap Airline Tickets

I never got around to summarizing the results of my data mining team’s research project. For those of you who never read the original post (it was one of my first), you might want to read about it first.

Discussion

As we dug into our project more, we decided that we really wanted to answer three main questions:

  1. What is the best day to purchase airline tickets?
  2. How far in advance should we purchase tickets?
  3. Are we getting the best price?

We continued using the Ruby Selenium driver for web scraping. We came up with a couple different ways to avoid getting blocked by Kayak.com, but none of our techniques worked flawlessly. It was sort of hit-and-miss oddly enough. Every now and again one of our IP’s would get blocked so we’d have to come up with an alternative.

We ended up scraping roughly 500,000 rows of data. We used a couple different randomized subsets of around 10,000 to 50,000 rows to make data processing faster.

From the CSVs we downloaded, we whittled down the number of input variables for our data mining algorithms. These are the ones we ended up using:

NAME TYPE DESCRIPTION
Airline Text Example: ‘AA’ is American Airlines
Arrive Text Arrival city as a three letter code
Arrive Time Integer Flight arrival time in military hours
Class Text Class of flight (e.g. business, coach, premium, mixed & first)
Departure Day Text Day of the week a flight departs
Difference Integer Number of days between downloading the information and the departure date
Download Day Text Day of the week flight data was downloaded
Duration Integer Number of minutes in flight
Price Integer Total cost for flight
Stops Total number of stops or layovers for a flight

Our outcome variable was Price, since we were trying to predict the price relative to the other input variables and how they affect price.

As far as algorithms, we settles on Artificial Neural Networks (ANN), K Nearest Neighbor (KNN) and Linear Regression.

Results

What is the best day to purchase airline tickets? In answering our first question, we discovered that Tuesday was the best day to buy tickets, followed by Wednesday and Thursday. Sunday is the worst day to buy tickets, followed by Saturday, Friday and Monday. The caveat is that this is not true for every airline. This is the general conclusion we drew from our data.

Best Day To Buy Airline Tickets

How far in advance should you purchase tickets? Our data seemed to indicate that 60 days before your flight is the best time to buy tickets. When you start getting closer to the date of your flight, prices go up. When you buy your tickets earlier, prices tend to go up. Once again, this is not true for all airlines.

When to Buy Airline Tickets

Are you getting the best price? As far our research, this was too hard to determine. We had too many variables and it was outside of the scope for our short research project.

Recommendations

If you wanted to do something similar to this (for school, for research or for fun), I would recommend you use less data in your research. This might sound obvious and it probably is to some, but it wasn’t to us at the time. We had all the data we needed; however, we included far too many variables in our algorithms to really find them useful or draw correlations. Many variables = obfuscated results.

Isolate the variables by using fewer variables. Focus on one airline. Focus on one airport. We wanted to know about multiple airlines and airports, but to get better results we should have studied them separately.

We also considered adding a “Buy/Buy Not” categorical variable so that we could run other algorithms that would determine whether to buy a ticket or not. This was out of scope and would have taken significant extra effort.

Lastly, we thought it would be worthwhile to collect data over a longer period of time (at least a year if not more) to detect seasonal and cyclical trends in ticket prices.

For Your Perusal

You can download the slides for our presentation; however, they offer limited description since they are strictly presentation slides (virtually no text). There are a couple images of some of the graphs that were produced. I’ve extracted most of the useful images and data.

I also managed to find a version of our web scraping script, which you can download and modify to your liking as well.

Bash, Ruby & Time Calculations

As a follow up to Clock In, Clock Out, I thought I would discuss the methods I used for calculating the amount of hours worked in my clock script, since working with time mathematically is always a bit tricky.

First off, Bash shell scripts are really great for text and file manipulation (among other powerful aspects). However, once you start getting into more complex functionality like math and anything object-oriented, you really want to switch over to another language or environment. That’s why I called upon Ruby to do my “heavy lifting” as I put it.

So let’s jump into the first timestamp the script records. Pretty basic.

TIME_IN=`date "+%Y/%m/%d %H:%M:%S"`

The backticks (`) allow me to store the result of the date command in the variable TIME_IN. (Actually, the backticks are sorta deprecated. The new convention is to use $(<command>) instead. Bad habit I suppose.) I store it in a variable so that I can use it later on to print the timestamp to Terminal (below).

echo -e "\033[1;32mStatus - Clocked in\033[0m"
echo -e "\033[0;37m$TIME_IN\033[0m"

Those funky bits of code (i.e. \033[0;37m$TIME_IN\033[0m ) are used to adjust the text color when printed in Terminal. In this case, the first line is green and the second line gray.

Now when you clock out, clock grabs the last line in the timecard, stores the new timestamp and feeds both into the timediff.rb script as follows:

TIME_IN=`tail -n 1 $TIMECARD`
TIME_OUT="Out - `date "+%Y/%m/%d %H:%M:%S"`"
...
HOURS="     Session Length - `timediff.rb "$TIME_IN" "$TIME_OUT"`"

You might have noticed that the TIME_OUT variable not only includes the timestamp from date but also some extra text, specifically “Out – “. Later on, clock uses the TIME_OUT variable to print to the console. timediff.rb is robust enough that this extra text is parseable and disregarded in the time difference calculation. Let’s take a look at what happens in timediff.rb

First I implement the required gems.

require 'rubygems'
require 'time'
require 'time_diff'

Next, timediff.rb uses the parse method to create a Time object with the appropriate date and time.

t1 = Time.parse(ARGV[0])
t2 = Time.parse(ARGV[1])

The ARGV parameter is an inherent array in Ruby that grabs arguments that are passed into the script when it is run, in this case $TIME_IN and $TIME_OUT from clock. Lastly, the script uses the diff method added to Time from the time_diff gem to find the difference between the two timestamps, which are subsequently sent back to clock with puts.

puts Time.diff(t1, t2, '%h:%m:%s')[:diff]

Since the output of diff() is actually a hash (see hashes in Ruby), the [:diff] call at the end of the line tells Ruby to only pass back the :diff portion of the hash with the timestamp formatted as ‘%h:%m:%s’.

Now let’s talk about timeadd.rb. In essence, clock grabs the hours for each recorded session and pumps each one into timeadd.rbas an argument.

TOTAL=`grep "Session Length -" $TIMECARD | cut -c23-30`
...
echo "Total Hours - `timeadd.rb $TOTAL | cut -d . -f 1`" >> $TIMECARD

First, timeadd.rb creates a base timestamp of the current day at 00:00:00 and a totalTimeInSecondsvariable. The rest of that portion is explained in the comments of the code below.

baseTime = Time.parse('00:00:00')
totalTimeInSeconds = 0.0

ARGV.each do|arg|

   # Parses the hours into a Time for today
   time = Time.parse(arg)

   # Calculates the time difference in seconds using the base time 00:00:00
   timeInSeconds = time - baseTime

   # Sums up the time in seconds
   totalTimeInSeconds += timeInSeconds

end

To calculate the total time worked for that period, I implemented a nice little gem called ChronicDuration. You can read more about it in a blog post by Everyday Rails. In short, ChronicDuration takes a time in seconds and converts it to a given format, which in this case is day:hours:minutes:seconds as indicated by :chrono. After that I pass it back to clock.

totalTime = ChronicDuration::output(totalTimeInSeconds, :format => :chrono)
puts "#{totalTime}"

Finally, clock does some final formatting on the timecard as explained in the previous post (download the files there too).

While Bash can handle basic math by converting time into seconds, I chose to take the Ruby route to learn some Ruby and keep the code in clock a bit more simple. Arguably, abstracting to external Ruby scripts is just as complex. However, it did simplify much of the work and I didn’t have to do a lot of tedious math and formatting in Bash.

Data Mining & Web Scraping

I’m taking a data mining course this year. It’s fascinating new stuff and very applicable with all this buzz about “big data” and “BI.” Our professor requires we do a research project. So I got together with two friends and we started brainstorming about what we wanted to mine. We talked about some of the things we’d like to do predictive analysis on, like stocks, bonds or other securities. We thought about March Madness predictions. Then we settled on what we named “The Last Minute Vacation.” We haven’t decided exactly what we want to predict yet, but we are gathering data on airline flights and prices. One idea was to try and predict the best time to buy tickets last minute since there is a certain window of opportunity when tickets are really cheap. If this can be consistently predicted then could be cost efficient to wait ’til the last minute to buy tickets. We realize that much of this information is widely known; however, we thought it would be fun to analyze the data ourselves and see what other trends we might find.

We have since researched some basic web scraping methods and settled on a GUI based Ruby gem called Selenium. Basically, you setup a web driver object in Ruby and point it to a browser client like Firefox or Chrome and then tell it to get a website. We settled on Kayak because it makes things really simple.

First off, Kayak links are easy to hack. Actually, you don’t even have to hack them. They just spell it out for you. For example, if you want to fly from Salt Lake City to San Francisco on March 13, 2012 and return March 16, 2012, you can type in the following link: http://www.kayak.com/#/flights/SLC-SFO/2012-03-13/2012-03-16. If you don’t want to do  a round-trip flight, just lose the last date. If you want a different airport, change the 3 letter airport code. That simple.

To make things even easier, Kayak actually lists a link to download a CSV file of all the flights available for your search query. We spent about 10 minutes trying to figure out how to best scrape the necessary data from each query and then stumbled upon this jewel. As you can imagine, this significantly simplified our task. Thank you Kayak.

The tricky part has been making sure Kayak doesn’t block us out. To try and tackle this we’ve added random wait times. We’ve also run into errors parsing the site for the CSV. Sometimes the site loads too slow or the link just isn’t there for whatever reason. Most of this was easily handled with Ruby begin-rescue clauses. Now we are just trying to automate the scraping from a server so we don’t have to manually run it. This should be easy enough, except that the server we are going to use is not GUI based, so that’s a bit of a problem.

The script presently scrapes about 150 CSV files per run. We grab flight data for 30 consecutive days from the current date and then grab every 7th flight from their up to about 6 months out. Once we finish the scraping and run some analysis I’ll post some of our findings if anything worthwhile shows up.

Clock In, Clock Out

I’ve worked (and presently work) a number of jobs where I have to keep track of my own hours. As with any repetitive task, the answer is to automate. So tracking your hours isn’t hard or time consuming, but having your computer do it for you while learning a little scripting appeals to me more. Thus, I concocted a nice little bash shell script with a pinch of Ruby–for the heavy lifting.

The script/program is simple. It’s called clock. All you do is type clock into the Terminal (mac) and…WAHLA! You’re clocked in. When you want to clock out, type clock again and you’re clocked out. Each time you clock in or out, the program records a timestamp with the date and time. After each clock out, it records the number of hours worked during that session. When it’s time to submit your timecard so you can receive your paycheck, simply type clock -f to finalize your timecard. This will sum your session hours and rename the timecard with the start and end date so you know the period.

I’ve included a zip archive of the files (clock, timeadd.rb, timediff.rb) so if you find yourself in the same boat you can give the program a shot. I suggest you add them to your own “bin” folder in your home directory like I’ve done. If you aren’t familiar with this concept, let me give a short explanation.

Dedicating a directory in your home folder is convenient if you like to write various little programs/scripts. It allows you to…segregate these from the built in programs/scripts that come with your OS so they are easily accessible and you remember which ones are yours. To make these programs work just like the rest of the programs (e.g. ls, rm, less, pwd, etc.) add the directory to your $PATH as follows:

export PATH=$PATH:/Users/USERNAME/bin

Of course, be sure to replace USERNAME with the obvious. You can type this command into the CLI or you can just add it to your .profile file in your home directory. If you don’t have a .profile file, go ahead and create one. Every time you open Terminal (or another CLI), the .profile file is read and executed, thus you will never have to type in those commands again manually.

Two more things to make this work:

  1. Make sure to open clock in a text editor and change the path where the timecard-current.txt will be saved (in addition to any other paths).
  2. Make sure to install the appropriate gems for the Ruby scripts. Simply type the following into Terminal:
    gem install chronic_duration time_diff

If you have any thoughts, suggestions or questions, please comment. I value feedback.