Avoiding Bias in Test Data

Before an application launches publicly it usually undergoes significant testing. This often requires the use of dummy data, especially in circumstances where the data cannot be accessed, would be too large to transfer, or is too sensitive to be used in testing.

When it came time to start testing the capabilities of the synchronization service we had built during my internship, I was assigned to build the script that would create the dummy data.  Before writing the script, my tech lead asked me if I had ever heard of index bias. I wasn’t quite sure what he was talking about so I replied that I hadn’t. He then asked me if I had heard of Benford’s Law, to which I replied “Yes,” thanks to Dr. Albrecht.

Josh–my tech lead–told me our test data should be as natural and realistic as possible. He explained that it should follow a normal distribution or bell curve, as opposed to the curve of Benford’s Law which is skewed right and not so normal. To accomplish this goal, we would need to avoid bias in test data.

Avoiding bias in test data can be easy if you know how and deliberately do it. However, since humans naturally like structured, contrived data and if you don’t already know how to avoid biased data, you will probably produced biased data.

According to Josh, index bias consists of at least three categories:

  1. Order Bias
  2. Subset Bias
  3. Dimensional Bias

Order Bias

Order bias occurs when test data is ordered and structured. This is often the bias humans are most guilty of in my opinion. It might look something like this:

Test1, Test2, Test3, Test4, …

Or like this:

Jethro, John, Jon, Jonathon, Jordan, Josh, Joshua, etc.

In real world data, names (or objects, dates, numbers, etc.) rarely come into the system in order. The order occurs after insertion.

In my situation, I needed to create a couple thousand Exchange mailboxes using realistic, normal names. Thus, I gathered 10,000 random names off the Internet and proceeded to randomize those names programmatically.

Subset Bias

Subset bias occurs when data is pulled from the same subset continually.

For example, if I were to pull two names (e.g. Josh and Joshua) from my list of names (e.g. Jethro, John, Jon, Jonathon, Jordan, Josh, Joshua) every time I needed a subset of names, my test data would be characterized by subset bias. This is unrealistic because data rarely come from the same subsets in the real world.

To avoid subset bias in my script, I randomly pulled a subset of the desired amount from my original list of 10,000 names. Furthermore, the subset I grabbed did not use consecutive values. In other words, I did not grab 5,000 names starting with a random name and ending with the 5,000th name after/before that name.

Dimensional Bias

Dimensional Bias occurs when test data is all the same/similar length, size, etc.

John, Mike, Dave, Carl, etc.

These names are all noticeably four characters long.

To avoid dimensional bias in the names I gathered, I checked for an equal ratio of male to female names (1:1) as discussed by trusty Internet resources. I also checked variance in name length, anything from 2-14 characters with an average somewhere in between that number.

Final Thoughts

For simpler applications and “smaller” deals the degree of testing may vary depending on how much is at stake. Furthermore, the effort required to avoid index bias may be unnecessary.

I recognize that I could have gone to considerable more lengths to ensure against index bias; however, I did what was deemed necessary for our circumstance.

Benford’s Law

In the first year of the Information Systems program at in the Marriott School of Management (BYU), students work on a programming project in the Enterprise Programming class titled “Benford’s Law.” It’s a really practical application of Benford’s Law and specific programming concepts all wrapped into one project.

For those of you unfamiliar with Benford’s Law, the basic premise is this: the first digit in a list or dataset of numbers has a specific probability of occurring depending on what that digit is (1-9…since 0 adds no value as a leading digit). This leading digit probability follows a logarithmic distribution. Thus, the number 1 has about a 30% chance of being the leading digit, the number 2 approximately 17.6%, the number 3 roughly 12.5%, and so on. This doesn’t seem to make sense initially since you’d think any given number would have an 11% chance of being the leading digit. However, history has proven otherwise.

Benford’s Law Logarithmic Distribution

I ran across a sweet website in my Internet travels: Testing Benford’s Law. It takes a couple real world examples of numbers from datasets and applies Benford’s Law. Some examples include Stackoverflow user reputation, most common iPhone passcodes and file sizes in the Linux source tree. Other articles across the web give more examples: Volcanic eruptions follow Benford’s Law & Fraudsters obey Benford’s Law.

Another example worth checking out is an application called Picalo (GNU). Picalo is an application designed for fraud detection developed in Python by Dr. Albrecht, a professor of the Marriott School of Management and the professor that assigns the Benford’s Law project. Dr. Albrecht has included a module for Picalo that specifically uses Benford’s Law to analyze data and aid in fraud detection. You can read more about Picalo and check out the picalo.Benfords module.

You can read a more detailed description of Benford’s Law on Wikipedia.