The trickiest bug I faced this year and what it taught me

A quick story about a seemingly simple bug that drove me nuts?  (Though, don’t they all?)  The story begins with a sorting bug in Java.

Comparable is the Java interface that lets you compare two objects.  The implementation must return 0 for equivalent objects, negative if first object is “less than” second object, and positive if first is “greater than” the second.  The comparison must be symmetric (a==b means b==a) and transitive (a<b and b<c implies a<c).  If you violate this principle, Java gives you a helpful error message at runtime: “Comparison method violates its general contract!”

I had a complex Comparable object, using a multi-level sort on four variables, which violated the contract.  For a test-driven developer, this should be easy enough to address.  I glanced at my Comparable implementation and found the variables I suspected to cause the issue.  I wrote JUnit tests creating lists of a handful of objects, in different combinations, trying to force the bug.  I wrote a dozen tests and none had failed yet.  I started to get desperate.

I was desperate enough to include a random number generator in my test case. (Oh the shame!)  I created a 20,000 iteration for loop, creating objects with random values in each of the fields.  I should have only needed 4*15*4*3 objects (number of valid values for each variable), but at this point I was taking no chances.  Finally, the test case failed.  Armed with a failing test, I went to the Comparable implementation and made a patch.  My entire test suite passed and I committed both the patch and the test.

Later that night, the test really bugged me.  How could I as a self-serious developer commit a test with random values in it – and 20,000 random values at that!  I disabled my patch and I serialized the random values to disk.  I worked on reducing the list to the minimal set that could reproduce the bug.  I was expecting to get down to 3 objects, or maybe 5-6 if the transitivity was really gnarly.

Using binary search I quickly whittled the list to 10,000 and 5,000 and 1,000.  At 1,000 items I imported the list to Excel and sorted them, now my removals would not be random.  I went from 1,000 to 500 to 250 to 150.  At this point my list was just 150 clones of 9 distinct variable combinations, and the first two variables were constant in all of the clones.  The first six items in my list are below:

[0,1,0,0],[0,1,0,0],[0,1,0,1],[0,1,0,1],[0,1,0,1],[0,1,0,2],….

Going from 150 to 75, suddenly the bug disappeared!  The strange part was, I had not removed any unique variable combinations, just groups of clones.  I started manually deleting 5 random lines at a time, testing each time.  Under 100 list items, it was still hit or miss whether the test would fail.  I was still only removing clones.

After two hours, I was down to a failing test with 40 list items, still from the same clone set.  I deleted lines randomly one at a time, getting down to 32 list items.  I could not get the test to fail with 31 list items, after trying all 32 combinations!

The good news was, I had a minimal test case which caused the sort function to fail.  The bad news was I still had 32 items in my test case.  And a bit of an existential crisis on my hands.  If my Comparator worked on a 31 item list, and failed on a 32 item list that was for all intents and purposes the same list (remember both lists have the same 9 values cloned in them), then is my Comparator really failing the Java Comparator contract or not?

I suspect what happened here is some irregularity in Java’s merge sort.  32 is a suspicious number, a perfect power of two.  Perhaps the merge sort could not find an appropriate pivot when it had two choices where to split the list (in 0-based indices, index 15 or 16) due to the implementation of my Comparator, and this did not happen in the 31 item list (with perfect middle of index 15).  In the end, I never figured it out.  I restored my patch and committed a new test with “only” 32 items.

I cannot tell if the root of all this was a bug in my comparator, Java’s merge sort, or the part of Java’s merge sort where it tries to tell if your Comparator is legal.  I may never know.  If it was really a Java bug, then this is a once in a lifetime career moment.

In any case this experience fills me with humility.  The software that we use every day, the hardware it runs on, the electrical grid powering them, all of these are complex systems and powerful abstractions built one on top of the next on top of the next.  It is truly amazing that anything works, and let’s be thankful that it does!

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.