Dec 8

Dec 8 Undertesting and overtesting

It is tempting to test that the code does what it should do and leave it at that, but it’s arguably even more important to test that it doesn’t do what it shouldn’t do.
—Mike Bland, “Goto Fail, Heartbleed, and Unit Testing Culture”

What is well-tested code? We tend to assume it’s something to do with test coverage: that is, if a test exists for each of your functions, they must be well-tested, right?

Right?

Not necessarily. Even when a certain piece of code is nominally covered by some test, that doesn’t mean it’s really tested In other words, just because a test causes some function to be executed, that doesn’t necessarily prove that the function does the right thing.

In Testing legacy code, we talked about what to do if your codebase has no tests at all, or not enough tests. Today, let’s look at the tests you do have, and what still might be wrong with them.

Optimistic tests

An optimistic test is designed only to confirm that the system works, not to prove that it doesn’t.

Surprisingly, one problem that optimistic tests can miss is when the system does nothing at all.

For example, consider a test for some function user.Create, like this:

func TestCreateCreatesGivenUser(t *testing.T) {
    t.Parallel()
    user.Create("Alice")
    if !user.Exists("Alice") {
        t.Error("Alice not created")
    }
}

(Listing user/5)

At first glance, this is plausible. We create a test user, Alice, and check if she subsequently exists. If not, the test fails. That part is fine. So, can you see what’s missing? If not, you might like to take a minute to think about it before reading on.

It’s always worth asking of any test whether it rigorously checks its preconditions as well as its postconditions. The developer’s focus, naturally enough, tends to be on the state of the world after the operation under test, but that can result in some risky assumptions about its prior state.

Suppose Create actually does nothing. How could this test pass? Only when Alice already exists. And would we know if that was the case?

Our test checks that Alice exists after calling Create, but what’s missing is a check that she doesn’t exist beforehand.

In other words, suppose we don’t clean up the database after each test run, so if Alice was ever created, she’ll still be there. And suppose someone later introduces a bug into Create that prevents it from actually creating users.

To put it another way, here’s an obviously incorrect implementation that nevertheless passes this test:

type User struct {
    Name string
}

var (
    m     = new(sync.Mutex)
    users = map[string]*User{
        "Alice": {
            Name: "Alice",
        },
    }
)

func Create(name string) {}

func Exists(name string) bool {
    m.Lock()
    defer m.Unlock()
    _, ok := users[name]
    return ok
}

(Listing user/5)

We thought we were testing Create, but we really aren’t, because Create does nothing at all, yet the test doesn’t detect that. Alice always exists, so the test always passes. This kind of mirage test is especially dangerous, because it looks like you have a test, but you don’t.

It turns out that Create needs not just to leave the world in a state where Alice exists. What’s important about Create is that it changes the world from a state where Alice doesn’t exist to one where she does.

You might think that goes without saying, but we’ve just proved that it doesn’t. We need to pay attention to preconditions as well as postconditions, according to the contract that the system under test is supposed to fulfil.

Let’s write a test that would catch this bug, then:

func TestCreateCreatesGivenUser(t *testing.T) {
    t.Parallel()
    if user.Exists("Alice") {
        t.Fatal("Alice unexpectedly exists")
    }
    user.Create("Alice")
    if !user.Exists("Alice") {
        t.Error("Alice not created")
    }
}

(Listing user/6)

The difference is very simple, but important: we check our preconditions.

What about this test, then?

func TestDeleteDeletesGivenUser(t *testing.T) {
    t.Parallel()
    user.Create("Alice")
    user.Delete("Alice")
    if user.Exists("Alice") {
        t.Error("Alice still exists after delete")
    }
}

Again, this looks reasonable on a cursory inspection. It creates Alice, deletes her (sorry, Alice), and then ensures that she no longer exists. What could be wrong with Delete that this test wouldn’t catch?

Well, what if both Create and Delete do nothing at all? That seems like a pretty major bug, yet this test doesn’t detect it. There are no preconditions, so the outcome of the test is the same whether Create and Delete actually have any effect or not. The test isn’t wrong, as far as it goes: it just doesn’t go far enough. There’s a pretty big loophole in it.

This kind of bug isn’t as unlikely as you might think, either. I’ve made this exact mistake in the past: I stubbed out Create and Delete methods with placeholders, then forgot that I hadn’t finished them, because the test was passing. It’s easy to do.

What we’re missing here, in fact, is another precondition: that the user does exist before we try to delete them.

func TestDeleteDeletesGivenUser(t *testing.T) {
    t.Parallel()
    user.Create("Alice")
    if !user.Exists("Alice") {
        t.Fatal("Alice not created")
    }
    user.Delete("Alice")
    if user.Exists("Alice") {
        t.Error("Alice still exists after delete")
    }
}

(Listing user/6)

If Create doesn’t do anything, this test will fail at the first check. If Delete doesn’t do anything, it’ll fail at the second.

In any non-trivial codebase, you’re pretty much guaranteed to find at least a few tests that are optimistically feeble in this way. Look for any test that doesn’t properly establish its preconditions, and fix it. This will add a lot of value to the test suite overall.

Another example of this kind of problem is when the test fails to check some important but implicit postconditions. For example, in TestDelete, the explicit postcondition here is that Alice shouldn’t exist after deletion, so what are we implicitly missing? What else could a reasonable person ask for from a Delete function?

As usual, a productive way to answer that is to think about possible bugs in Delete. Suppose, for example, that Delete mistakenly deletes not only Alice, but all users in the database. That kind of thing is surprisingly easy to do, especially with SQL queries (omitting a WHERE clause, for example).

If calling Delete on a single user instead nukes the whole database, that’s a pretty major bug, wouldn’t you say? This test doesn’t detect it, because it focuses only on what should happen, and ignores what shouldn’t.

How could we detect such a bug, then? Quite easily, it turns out.

Here’s what we do. We create two users in the test, but delete only one of them. Then we check that the one we deleted doesn’t exist, and the one we didn’t delete still exists:

func TestDeleteDeletesGivenUserOnly(t *testing.T) {
    t.Parallel()
    user.Create("Alice")
    if !user.Exists("Alice") {
        t.Fatal("Alice not created")
    }
    user.Create("Bob")
    if !user.Exists("Bob") {
        t.Fatal("Bob not created")
    }
    user.Delete("Alice")
    if user.Exists("Alice") {
        t.Error("Alice still exists after delete")
    }
    if !user.Exists("Bob") {
        t.Error("Bob was unexpectedly deleted")
    }
}

This test has accumulated a bit of paperwork, so let’s refactor that out into a helper function:

func TestDeleteDeletesGivenUserOnly(t *testing.T) {
    t.Parallel()
    createUserOrFail(t, "Alice")
    createUserOrFail(t, "Bob")
    user.Delete("Alice")
    if user.Exists("Alice") {
        t.Error("Alice still exists after delete")
    }
    if !user.Exists("Bob") {
        t.Error("Bob was unexpectedly deleted")
    }
}

func createUserOrFail(t *testing.T, name string) {
    t.Helper()
    user.Create(name)
    if !user.Exists(name) {
        t.Fatalf("%s not created", name)
    }
}

(Listing user/7)

Who would have thought that there were so much scope for things to go wrong with a seemingly simple Delete function? Well, you would, or at least you will now.

If you find yourself, as a result, becoming thoroughly sceptical about the idea that anything works the way it’s supposed to, congratulations: you’re thinking like a tester.

Persnickety tests

Sometimes, though not often, people can take testing a bit too much to heart, and test more than strictly necessary. As we’ve seen, it’s much easier to err the other way, and leave out important things such as preconditions and implicit postconditions, like not deleting all the users in the database. But overtesting does afflict some test suites.

It’s important to keep tests focused on only the part of the system they’re supposed to care about, and on only the behaviour that matters. They should avoid checking for irrelevant things.

Beware also of simply comparing too much. Sometimes it can be convenient to compare a function’s entire result against an expected value, rather than individually checking each of its fields, but it’s not always a good idea.

Comparing the whole struct makes sense when all the fields are affected by the behaviour under test. But when only some fields are important to this test, checking irrelevant fields makes the test brittle, and obscures its real purpose. The same applies to checking entire strings or output files, when only certain parts of the data are actually important.

The easiest way to avoid brittle tests is to check only the properties you care about. Be selective in your assertions. Don’t check for exact string matches, for example, but look for relevant substrings that will remain unchanged as the program evolves.
—Alan Donovan & Brian Kernighan, “The Go Programming Language”

Watch out for tests that lazily compare output against a golden file, for example, when the behaviour they’re testing is only about a small subset of that file. Similarly, a test should not assert the exact value of an error, but only that there is some error, when there’s supposed to be.

Some exposure to the idea of property-based testing, as commonly used in fuzz testing, can also be helpful for constructing robust tests. For example, what’s important about a result is often not its exact value, but some property of the value, especially an invariant property.

I recently reviewed a program that needed to create a “fingerprint” of a piece of data, for deduplication purposes. In case the same data was submitted to the system later, the fingerprint would enable the system to recognise it without actually having to store all the data, which could be very large.

A cryptographic digest, or hash value, is an obvious way to do this, so the program had a test something like this:

func TestHashReturnsMD5HashOfGivenData(t *testing.T) {
    t.Parallel()
    data := []byte("These pretzels are making me thirsty.")
    want := md5.Sum(data)
    got := fingerprint.Hash(data)
    if want != got {
        t.Errorf("want %v, got %v", want, got)
    }
}

(Listing fingerprint/1)

The implementation of Hash doesn’t matter, but let’s assume it’s something like this:

func Hash(data []byte) [md5.Size]byte {
    return md5.Sum(data)
}

(Listing fingerprint/1)

Fine. But MD5 is insecure, so I suggested using a SHA-256 hash instead:

func Hash(data []byte) [sha256.Size]byte {
    return sha256.Sum256(data)
}

(Listing fingerprint/2)

This broke the test, which makes no sense, because Hash works. So what’s the problem?

Well, what are we really testing here? Not that Hash produces an MD5 hash, specifically; that’s incidental. What matters is that the same data should always hash to the same value, whatever that value actually is. And, no less importantly, that different data should hash to different values.

So we ended up with a test something like this instead:

func TestHashGivesSameUniqueHashForSameData(t *testing.T) {
    t.Parallel()
    data := []byte("These pretzels are making me thirsty.")
    orig := fingerprint.Hash(data)
    same := fingerprint.Hash(data)
    different := fingerprint.Hash([]byte("Hello, Newman"))
    if orig != same {
        t.Error("same data produced different hash")
    }
    if orig == different {
        t.Error("different data produced same hash")
    }
}

(Listing fingerprint/2)

This test is better, because it doesn’t care about the implementation details of Hash, such as which algorithm is used. What it cares about is that the same input to Hash always gives the same result, so it catches bugs like an unstable hash algorithm or even a random result.

And, because a maliciously lazy implementation of Hash might simply always return the same fixed value, the test also requires that different data hashes to different values.

This isn’t completely bug-proof, of course. It’s possible, for example, that Hash always produces the same value unless the input is exactly “Hello, Newman”. We could use fuzz testing to tackle this, or just a table test with a bunch of different inputs.

But you get the point. We made the test both less brittle and less feeble without adding much extra code; all it took was a little extra thinking.

In the next and final post in this series, we’ll look at speed and reliability of tests. If the test suite as a whole is too slow, it simply won’t be run, so it’s no use. On the other hand, if the tests are unreliable, and often fail even when the code is correct, they’re also useless in a different way. Let’s talk about how to fix that situation next time, in Slow, flaky, and failing.

Dec 8 Undertesting and overtesting

Optimistic tests

Persnickety tests

Dec 18 Why you can't meditate

Nov 19 Rust and Go vs everything else

Related Posts