Avoiding Bias in Test Data

Before an application launches publicly it usually undergoes significant testing. This often requires the use of dummy data, especially in circumstances where the data cannot be accessed, would be too large to transfer, or is too sensitive to be used in testing.

When it came time to start testing the capabilities of the synchronization service we had built during my internship, I was assigned to build the script that would create the dummy data.  Before writing the script, my tech lead asked me if I had ever heard of index bias. I wasn’t quite sure what he was talking about so I replied that I hadn’t. He then asked me if I had heard of Benford’s Law, to which I replied “Yes,” thanks to Dr. Albrecht.

Josh–my tech lead–told me our test data should be as natural and realistic as possible. He explained that it should follow a normal distribution or bell curve, as opposed to the curve of Benford’s Law which is skewed right and not so normal. To accomplish this goal, we would need to avoid bias in test data.

Avoiding bias in test data can be easy if you know how and deliberately do it. However, since humans naturally like structured, contrived data and if you don’t already know how to avoid biased data, you will probably produced biased data.

According to Josh, index bias consists of at least three categories:

1. Order Bias
2. Subset Bias
3. Dimensional Bias

Order Bias

Order bias occurs when test data is ordered and structured. This is often the bias humans are most guilty of in my opinion. It might look something like this:

Test1, Test2, Test3, Test4, …

Or like this:

Jethro, John, Jon, Jonathon, Jordan, Josh, Joshua, etc.

In real world data, names (or objects, dates, numbers, etc.) rarely come into the system in order. The order occurs after insertion.

In my situation, I needed to create a couple thousand Exchange mailboxes using realistic, normal names. Thus, I gathered 10,000 random names off the Internet and proceeded to randomize those names programmatically.

Subset Bias

Subset bias occurs when data is pulled from the same subset continually.

For example, if I were to pull two names (e.g. Josh and Joshua) from my list of names (e.g. Jethro, John, Jon, Jonathon, Jordan, Josh, Joshua) every time I needed a subset of names, my test data would be characterized by subset bias. This is unrealistic because data rarely come from the same subsets in the real world.

To avoid subset bias in my script, I randomly pulled a subset of the desired amount from my original list of 10,000 names. Furthermore, the subset I grabbed did not use consecutive values. In other words, I did not grab 5,000 names starting with a random name and ending with the 5,000th name after/before that name.

Dimensional Bias

Dimensional Bias occurs when test data is all the same/similar length, size, etc.

John, Mike, Dave, Carl, etc.

These names are all noticeably four characters long.

To avoid dimensional bias in the names I gathered, I checked for an equal ratio of male to female names (1:1) as discussed by trusty Internet resources. I also checked variance in name length, anything from 2-14 characters with an average somewhere in between that number.

Final Thoughts

For simpler applications and “smaller” deals the degree of testing may vary depending on how much is at stake. Furthermore, the effort required to avoid index bias may be unnecessary.

I recognize that I could have gone to considerable more lengths to ensure against index bias; however, I did what was deemed necessary for our circumstance.

App of the Day: Console

I currently work as a Quality Engineer at Palantir Technologies. I do a lot of feature and product testing. As a result, one of the most common tools for testing is the POSIX tool tail.

Linux/Unix/Mac users may be familiar with tail. In layman terms, tail allows you to grab a number of lines at the end of a text based file. In testing, we use tail -f <filename> a lot because the “f” option immediately and automatically updates the CLI with the most recently written lines of text in the file you are tailing. Testers love this because they like to see the stack traces printed on their screen the moment something errors in a program or system. Many development environments have consoles built in for errors and system printing and logging (think Eclipse, NetBeans, etc.). Java based programs also have the option of having the Java console automatically open when you run a Java based program.

On my Windows box at work, I use Cygwin to run the tail command. Cygwin is a Linux-like environment for Windows that allows users to port software running on POSIX systems (such as Linux, BSD, and Unix systems) to Windows. On my Mac, I just use Terminal. However, more recently I’ve discovered an even greater tool called Console. Console actually comes as a pre-installed utility with the Mac OS X operating system. I’ve found that the Utilities folder is full of great (whadoyaknow!) utility apps. I suggest you take a gander through that folder if you haven’t already. I’ve used the Grapher app in my ECON 110 class this semester a couple times (back when I used to take notes on my computer; I’ve since switched to paper since we do more graphing than anything else. Which reminds me about a great note taking app for iPad called Notes Plus. Alas, I digress. I will save that discussion for another post.).

The reason I love Console most, is because you can tell it to bounce in the dock when stack traces print to it. You don’t have to have it open on a second monitor so it’s always visible, or worse yet peaking out on the side of the screen behind the program you are testing. Better yet, you can actually choose to have it come to the forefront for a limited amount of time (say 5 seconds) and then disappear in the background again. I’ve searched for something like this on Windows and haven’t seen anything like it. Probably because there are IDE’s and stuff, but regardless, it’s a beaut. It has other great functions for console type stuff too. Check it out if you’re a tester or programmer. It might be pretty handy.