# Data Mining Wrap-up: Cheap Airline Tickets

I never got around to summarizing the results of my data mining team’s research project. For those of you who never read the original post (it was one of my first), you might want to read about it first.

# Discussion

As we dug into our project more, we decided that we really wanted to answer three main questions:

1. What is the best day to purchase airline tickets?
2. How far in advance should we purchase tickets?
3. Are we getting the best price?

We continued using the Ruby Selenium driver for web scraping. We came up with a couple different ways to avoid getting blocked by Kayak.com, but none of our techniques worked flawlessly. It was sort of hit-and-miss oddly enough. Every now and again one of our IP’s would get blocked so we’d have to come up with an alternative.

We ended up scraping roughly 500,000 rows of data. We used a couple different randomized subsets of around 10,000 to 50,000 rows to make data processing faster.

From the CSVs we downloaded, we whittled down the number of input variables for our data mining algorithms. These are the ones we ended up using:

NAME TYPE DESCRIPTION
Airline Text Example: ‘AA’ is American Airlines
Arrive Text Arrival city as a three letter code
Arrive Time Integer Flight arrival time in military hours
Class Text Class of flight (e.g. business, coach, premium, mixed & first)
Departure Day Text Day of the week a flight departs
Duration Integer Number of minutes in flight
Price Integer Total cost for flight
Stops Total number of stops or layovers for a flight

Our outcome variable was Price, since we were trying to predict the price relative to the other input variables and how they affect price.

As far as algorithms, we settles on Artificial Neural Networks (ANN), K Nearest Neighbor (KNN) and Linear Regression.

# Results

What is the best day to purchase airline tickets? In answering our first question, we discovered that Tuesday was the best day to buy tickets, followed by Wednesday and Thursday. Sunday is the worst day to buy tickets, followed by Saturday, Friday and Monday. The caveat is that this is not true for every airline. This is the general conclusion we drew from our data.

How far in advance should you purchase tickets? Our data seemed to indicate that 60 days before your flight is the best time to buy tickets. When you start getting closer to the date of your flight, prices go up. When you buy your tickets earlier, prices tend to go up. Once again, this is not true for all airlines.

Are you getting the best price? As far our research, this was too hard to determine. We had too many variables and it was outside of the scope for our short research project.

Recommendations

If you wanted to do something similar to this (for school, for research or for fun), I would recommend you use less data in your research. This might sound obvious and it probably is to some, but it wasn’t to us at the time. We had all the data we needed; however, we included far too many variables in our algorithms to really find them useful or draw correlations. Many variables = obfuscated results.

Isolate the variables by using fewer variables. Focus on one airline. Focus on one airport. We wanted to know about multiple airlines and airports, but to get better results we should have studied them separately.

We also considered adding a “Buy/Buy Not” categorical variable so that we could run other algorithms that would determine whether to buy a ticket or not. This was out of scope and would have taken significant extra effort.

Lastly, we thought it would be worthwhile to collect data over a longer period of time (at least a year if not more) to detect seasonal and cyclical trends in ticket prices.

You can download the slides for our presentation; however, they offer limited description since they are strictly presentation slides (virtually no text). There are a couple images of some of the graphs that were produced. I’ve extracted most of the useful images and data.

I also managed to find a version of our web scraping script, which you can download and modify to your liking as well.

# Lock Your Screen on Mac

The Windows shortcut for locking the screen is pretty convenient (Windows Key + L). When you’re at work or a more public setting it’s nice to be able to lock your screen quickly before leaving your computer temporarily. In my searches to find a convenient way on Mac, I found a number of ways, many of which involve extra applications and menus in the menubar. I’ll briefly mention a few ways that may interest you, then mention the method I settled on.

One way is to add the Keychain Access menu to your menubar. To do this, Open Keychain Access (Cmd + Spacebar >> type Keych >> Enter) then open preferences (Cmd + ,) and check the first option to “Show keychain status in menubar.” Now you can lock the screen from the Keychain Access menu.

Another way is to enable the Fast User Switching menu. Do this by accessing System Preferences >> Users & Groups >> Login Options and then checking the option to “Show fast user switching menu as…” This will add a menu next to Spotlight and the clock allowing you to show the login window.

While both of these methods are effective, they involve extra clutter in your menubar. So I’ve discovered two other easy ways that involve shortcuts. The first method uses the Screensaver Shortcut. To make this work, you’ll have to change your security settings such that your computer requires a password immediately after the screensaver starts. This can be done by navigating to System Preferences >> Security & Privacy. Check the first box and set the menu to “immediately”. Now all you have to do is key Ctrl + Shift + Eject. If you’ve done it correctly, you should be asked for a password as soon as you exit the screensaver via moving the cursor or pressing a key.

I prefer to have my screensaver come on relatively quickly (10-15 minutes) and I don’t like it when my computer asks me for a password immediately, so I went another route using Automator. It turns out you can control Fast User Switching from the command line using a binary called CGSession. CGSession takes two options (that I know of), namely -suspend and -switchToUserID <value>. Simply open Automator (Cmd + Spacebar >> type Autom >> Enter) and under the Utilities section find “Run a shell script” and drag it to the section on the right. Now copy and past the following script in the field where it says “cat” by default:

Be sure to change the menu for “Service receives” to “no input”. Now save this simple one step workflow as a service and call it something like “Lock Screen”. Now we are going to assign this service a global hotkey/shortcut. Navigate to System Preferences >> Keyboard >> Keyboard Shortcuts >> Services. At the very bottom, you should see a general service called “Lock Screen”. Make sure the box is checked and when you hover over it you should see a little button that says “Add Shortcut”. Click that and add a shorcut like Cmd + Alt/Opt + L. Now you should be able to lock your screen globally at any time.

Alternatively, you could try using Quicksilver–a simple app that allows you to create powerful global shortcuts and the like. I used this before creating a service through Automator; however, I found it a bit cumbersome to use and setup. You shouldn’t need an extra app for something so simple.

Which way do you like best?

# Show/Hide Invisible Files on Mac & Other Secrets

Showing hidden files can be somewhat of a pain in Mac OS X. I’ve come up with a number of ways to make it easier for myself.

The simplest, most direct way of doing this is a command in the Terminal. One way is to navigate to the appropriate directory in Terminal and type ls -lha. Alternatively, you can type ls -lha <path/to/directory>. This will give you a list of all the hidden files for that directory. But maybe you aren’t proficient with Terminal commands and/or you need to do some file manipulation and/or you don’t want to use the CLI.

Much like my post Hide Desktop Icons on Mac, I’ve created a little script I call hidden that automatically shows and hides hidden files.

#!/bin/bash

# checks file visibility and stores value in a variable

# toggle file visibility based on variable
if [ "\$isVisible" = 1 ]; then
defaults write com.apple.Finder AppleShowAllFiles -bool false
else
defaults write com.apple.Finder AppleShowAllFiles -bool true
fi

# force changes by restarting Finder
killall Finder

Paste that code into a text editor, save it to a directory in your \$PATH and make it executable (chmod 755 <filename>).

Two other alternatives I’ve stumbled upon are HideSwitch and Secrets Prefpane (at the time of this post the site was down. Get it at MacUpdate instead.) HideSwitch is just a simple mini-app with two buttons to hide and show hidden files. Secrets Prefpane is a just that: a button that shows up as a system preference and turns into a prefpane once clicked. However Secrets is quite powerful and can do a lot more than just show and hide files. Secrets includes a variety of features, such as:

• Selecting the format and destination folder of saved screenshots
• Changing the login window desktop picture
• Changing Dock effects
• Seeing the contents of folders when QuickLooking (I don’t think this works on Lion)
• Enabling the debug menu in iCal

Those are just a few that I’ve found useful and interesting. Since it’s free, you may want to download it and check it out. Might have a feature you’ve been dying to have. Secrets also taps into many of the preferences of your other programs such as Adium, iTunes, Cyberduck, Skype, Preview, Transmission, etc.

# Hide Desktop Icons on Mac

Ever wanted to hide your desktop icons briefly and easily on Mac OS X for a presentation, screencast or just to hide everyday clutter? Here’s a simple Bash script you can use.

#!/bin/bash

# checks visibility and stores value in a variable

# toggle desktop icon visibility based on variable
if [ "\$isVisible" = 1 ]; then
defaults write com.apple.finder CreateDesktop -bool false
else
defaults write com.apple.finder CreateDesktop -bool true
fi

# force changes by restarting Finder
killall Finder

Paste that into a text editor and save it without a suffix/filetype as something like desktop. Then execute the following command while in the folder where you saved the script (preferably in your personal bin directory): chmod 755 desktop. You should be good to go as long as the directory you saved it in is mapped to your \$PATH (If you aren’t sure what that means, read the 4th and 5th paragraphs of Clock In, Clock Out).

You can find some simple apps to do this if you are repelled by the Terminal or love extra menubar buttons; just search Google. Some of them cost money (CamouFlage – \$1.99), though I don’t know who would pay for something so simple. There are some other free alternatives.

# Bash, Ruby & Time Calculations

As a follow up to Clock In, Clock Out, I thought I would discuss the methods I used for calculating the amount of hours worked in my clock script, since working with time mathematically is always a bit tricky.

First off, Bash shell scripts are really great for text and file manipulation (among other powerful aspects). However, once you start getting into more complex functionality like math and anything object-oriented, you really want to switch over to another language or environment. That’s why I called upon Ruby to do my “heavy lifting” as I put it.

So let’s jump into the first timestamp the script records. Pretty basic.

TIME_IN=`date "+%Y/%m/%d %H:%M:%S"`

The backticks (`) allow me to store the result of the date command in the variable TIME_IN. (Actually, the backticks are sorta deprecated. The new convention is to use \$(<command>) instead. Bad habit I suppose.) I store it in a variable so that I can use it later on to print the timestamp to Terminal (below).

echo -e "\033[1;32mStatus - Clocked in\033[0m"
echo -e "\033[0;37m\$TIME_IN\033[0m"

Those funky bits of code (i.e. \033[0;37m\$TIME_IN\033[0m ) are used to adjust the text color when printed in Terminal. In this case, the first line is green and the second line gray.

Now when you clock out, clock grabs the last line in the timecard, stores the new timestamp and feeds both into the timediff.rb script as follows:

TIME_IN=`tail -n 1 \$TIMECARD`
TIME_OUT="Out - `date "+%Y/%m/%d %H:%M:%S"`"
...
HOURS="     Session Length - `timediff.rb "\$TIME_IN" "\$TIME_OUT"`"

You might have noticed that the TIME_OUT variable not only includes the timestamp from date but also some extra text, specifically “Out – “. Later on, clock uses the TIME_OUT variable to print to the console. timediff.rb is robust enough that this extra text is parseable and disregarded in the time difference calculation. Let’s take a look at what happens in timediff.rb

First I implement the required gems.

require 'rubygems'
require 'time'
require 'time_diff'

Next, timediff.rb uses the parse method to create a Time object with the appropriate date and time.

t1 = Time.parse(ARGV[0])
t2 = Time.parse(ARGV[1])

The ARGV parameter is an inherent array in Ruby that grabs arguments that are passed into the script when it is run, in this case \$TIME_IN and \$TIME_OUT from clock. Lastly, the script uses the diff method added to Time from the time_diff gem to find the difference between the two timestamps, which are subsequently sent back to clock with puts.

puts Time.diff(t1, t2, '%h:%m:%s')[:diff]

Since the output of diff() is actually a hash (see hashes in Ruby), the [:diff] call at the end of the line tells Ruby to only pass back the :diff portion of the hash with the timestamp formatted as ‘%h:%m:%s’.

Now let’s talk about timeadd.rb. In essence, clock grabs the hours for each recorded session and pumps each one into timeadd.rbas an argument.

TOTAL=`grep "Session Length -" \$TIMECARD | cut -c23-30`
...
echo "Total Hours - `timeadd.rb \$TOTAL | cut -d . -f 1`" >> \$TIMECARD

First, timeadd.rb creates a base timestamp of the current day at 00:00:00 and a totalTimeInSecondsvariable. The rest of that portion is explained in the comments of the code below.

baseTime = Time.parse('00:00:00')
totalTimeInSeconds = 0.0

ARGV.each do|arg|

# Parses the hours into a Time for today
time = Time.parse(arg)

# Calculates the time difference in seconds using the base time 00:00:00
timeInSeconds = time - baseTime

# Sums up the time in seconds
totalTimeInSeconds += timeInSeconds

end

To calculate the total time worked for that period, I implemented a nice little gem called ChronicDuration. You can read more about it in a blog post by Everyday Rails. In short, ChronicDuration takes a time in seconds and converts it to a given format, which in this case is day:hours:minutes:seconds as indicated by :chrono. After that I pass it back to clock.

totalTime = ChronicDuration::output(totalTimeInSeconds, :format => :chrono)
puts "#{totalTime}"

Finally, clock does some final formatting on the timecard as explained in the previous post (download the files there too).

While Bash can handle basic math by converting time into seconds, I chose to take the Ruby route to learn some Ruby and keep the code in clock a bit more simple. Arguably, abstracting to external Ruby scripts is just as complex. However, it did simplify much of the work and I didn’t have to do a lot of tedious math and formatting in Bash.

# Data Mining & Web Scraping

I’m taking a data mining course this year. It’s fascinating new stuff and very applicable with all this buzz about “big data” and “BI.” Our professor requires we do a research project. So I got together with two friends and we started brainstorming about what we wanted to mine. We talked about some of the things we’d like to do predictive analysis on, like stocks, bonds or other securities. We thought about March Madness predictions. Then we settled on what we named “The Last Minute Vacation.” We haven’t decided exactly what we want to predict yet, but we are gathering data on airline flights and prices. One idea was to try and predict the best time to buy tickets last minute since there is a certain window of opportunity when tickets are really cheap. If this can be consistently predicted then could be cost efficient to wait ’til the last minute to buy tickets. We realize that much of this information is widely known; however, we thought it would be fun to analyze the data ourselves and see what other trends we might find.

We have since researched some basic web scraping methods and settled on a GUI based Ruby gem called Selenium. Basically, you setup a web driver object in Ruby and point it to a browser client like Firefox or Chrome and then tell it to get a website. We settled on Kayak because it makes things really simple.

First off, Kayak links are easy to hack. Actually, you don’t even have to hack them. They just spell it out for you. For example, if you want to fly from Salt Lake City to San Francisco on March 13, 2012 and return March 16, 2012, you can type in the following link: http://www.kayak.com/#/flights/SLC-SFO/2012-03-13/2012-03-16. If you don’t want to do  a round-trip flight, just lose the last date. If you want a different airport, change the 3 letter airport code. That simple.

To make things even easier, Kayak actually lists a link to download a CSV file of all the flights available for your search query. We spent about 10 minutes trying to figure out how to best scrape the necessary data from each query and then stumbled upon this jewel. As you can imagine, this significantly simplified our task. Thank you Kayak.

The tricky part has been making sure Kayak doesn’t block us out. To try and tackle this we’ve added random wait times. We’ve also run into errors parsing the site for the CSV. Sometimes the site loads too slow or the link just isn’t there for whatever reason. Most of this was easily handled with Ruby begin-rescue clauses. Now we are just trying to automate the scraping from a server so we don’t have to manually run it. This should be easy enough, except that the server we are going to use is not GUI based, so that’s a bit of a problem.

The script presently scrapes about 150 CSV files per run. We grab flight data for 30 consecutive days from the current date and then grab every 7th flight from their up to about 6 months out. Once we finish the scraping and run some analysis I’ll post some of our findings if anything worthwhile shows up.

# Clock In, Clock Out

I’ve worked (and presently work) a number of jobs where I have to keep track of my own hours. As with any repetitive task, the answer is to automate. So tracking your hours isn’t hard or time consuming, but having your computer do it for you while learning a little scripting appeals to me more. Thus, I concocted a nice little bash shell script with a pinch of Ruby–for the heavy lifting.

The script/program is simple. It’s called clock. All you do is type clock into the Terminal (mac) and…WAHLA! You’re clocked in. When you want to clock out, type clock again and you’re clocked out. Each time you clock in or out, the program records a timestamp with the date and time. After each clock out, it records the number of hours worked during that session. When it’s time to submit your timecard so you can receive your paycheck, simply type clock -f to finalize your timecard. This will sum your session hours and rename the timecard with the start and end date so you know the period.

I’ve included a zip archive of the files (clock, timeadd.rb, timediff.rb) so if you find yourself in the same boat you can give the program a shot. I suggest you add them to your own “bin” folder in your home directory like I’ve done. If you aren’t familiar with this concept, let me give a short explanation.

Dedicating a directory in your home folder is convenient if you like to write various little programs/scripts. It allows you to…segregate these from the built in programs/scripts that come with your OS so they are easily accessible and you remember which ones are yours. To make these programs work just like the rest of the programs (e.g. ls, rm, less, pwd, etc.) add the directory to your \$PATH as follows:

Of course, be sure to replace USERNAME with the obvious. You can type this command into the CLI or you can just add it to your .profile file in your home directory. If you don’t have a .profile file, go ahead and create one. Every time you open Terminal (or another CLI), the .profile file is read and executed, thus you will never have to type in those commands again manually.

Two more things to make this work:

1. Make sure to open clock in a text editor and change the path where the timecard-current.txt will be saved (in addition to any other paths).
2. Make sure to install the appropriate gems for the Ruby scripts. Simply type the following into Terminal:
gem install chronic_duration time_diff

If you have any thoughts, suggestions or questions, please comment. I value feedback.