Avoiding Bias in Test Data

Before an application launches publicly it usually undergoes significant testing. This often requires the use of dummy data, especially in circumstances where the data cannot be accessed, would be too large to transfer, or is too sensitive to be used in testing.

When it came time to start testing the capabilities of the synchronization service we had built during my internship, I was assigned to build the script that would create the dummy data.  Before writing the script, my tech lead asked me if I had ever heard of index bias. I wasn’t quite sure what he was talking about so I replied that I hadn’t. He then asked me if I had heard of Benford’s Law, to which I replied “Yes,” thanks to Dr. Albrecht.

Josh–my tech lead–told me our test data should be as natural and realistic as possible. He explained that it should follow a normal distribution or bell curve, as opposed to the curve of Benford’s Law which is skewed right and not so normal. To accomplish this goal, we would need to avoid bias in test data.

Avoiding bias in test data can be easy if you know how and deliberately do it. However, since humans naturally like structured, contrived data and if you don’t already know how to avoid biased data, you will probably produced biased data.

According to Josh, index bias consists of at least three categories:

  1. Order Bias
  2. Subset Bias
  3. Dimensional Bias

Order Bias

Order bias occurs when test data is ordered and structured. This is often the bias humans are most guilty of in my opinion. It might look something like this:

Test1, Test2, Test3, Test4, …

Or like this:

Jethro, John, Jon, Jonathon, Jordan, Josh, Joshua, etc.

In real world data, names (or objects, dates, numbers, etc.) rarely come into the system in order. The order occurs after insertion.

In my situation, I needed to create a couple thousand Exchange mailboxes using realistic, normal names. Thus, I gathered 10,000 random names off the Internet and proceeded to randomize those names programmatically.

Subset Bias

Subset bias occurs when data is pulled from the same subset continually.

For example, if I were to pull two names (e.g. Josh and Joshua) from my list of names (e.g. Jethro, John, Jon, Jonathon, Jordan, Josh, Joshua) every time I needed a subset of names, my test data would be characterized by subset bias. This is unrealistic because data rarely come from the same subsets in the real world.

To avoid subset bias in my script, I randomly pulled a subset of the desired amount from my original list of 10,000 names. Furthermore, the subset I grabbed did not use consecutive values. In other words, I did not grab 5,000 names starting with a random name and ending with the 5,000th name after/before that name.

Dimensional Bias

Dimensional Bias occurs when test data is all the same/similar length, size, etc.

John, Mike, Dave, Carl, etc.

These names are all noticeably four characters long.

To avoid dimensional bias in the names I gathered, I checked for an equal ratio of male to female names (1:1) as discussed by trusty Internet resources. I also checked variance in name length, anything from 2-14 characters with an average somewhere in between that number.

Final Thoughts

For simpler applications and “smaller” deals the degree of testing may vary depending on how much is at stake. Furthermore, the effort required to avoid index bias may be unnecessary.

I recognize that I could have gone to considerable more lengths to ensure against index bias; however, I did what was deemed necessary for our circumstance.

Comments, questions and feedback welcome.