Finding bugs with fuzzing

Finding bugs with fuzzing

There are bugs in your program; I guarantee it.

We can’t prove anything in science. But the best that scientists can do is fail to disprove things while pointing to how hard they tried.

—Richard Dawkins, “The Greatest Show on Earth”

This is the final article in a series of four about fuzzing in Go:

  1. Random testing in Go
  2. Fuzz tests in Go
  3. Writing a Go fuzz target
  4. Finding bugs with fuzzing

In the earlier parts of this series, we introduced the idea of fuzz testing, showed how it works in Go, and wrote a simple fuzz test aimed at detecting bugs in a function named FirstRune (see Part 3 for details). Here it is:

func FuzzFirstRune(f *testing.F) {
    f.Add("Hello")
    f.Add("world")
    f.Fuzz(func(t *testing.T, s string) {
        got := runes.FirstRune(s)
        want, _ := utf8.DecodeRuneInString(s)
        if want == utf8.RuneError {
            t.Skip() // don't bother testing invalid runes
        }
        if want != got {
            t.Errorf("given %q (0x%[1]x): want '%c' (0x%[2]x)",
                s, want)
            t.Errorf("got '%c' (0x%[1]x)", got)
        }
    })
}

(Listing runes/1)

As usual when developing a program guided by tests, we wrote a cheerfully rubbish implementation of FirstRune that does basically nothing, just to validate that our test is useful:

func FirstRune(s string) rune {
    return 0
}

(Listing runes/1)

If the test didn’t fail with this function, we might start to worry a bit about whether it would detect any bugs. But it does indeed report, reassuringly:

go test -fuzz .

fuzz: elapsed: 0s, gathering baseline coverage: 0/2 completed
failure while testing seed corpus entry: FuzzFirstRune/seed#1
fuzz: elapsed: 0s, gathering baseline coverage: 0/2 completed
--- FAIL: FuzzFirstRune (0.02s)
    --- FAIL: FuzzFirstRune (0.00s)
        runes_test.go:22: given "world" (0x776f726c64): want 'w'
        (0x77)
        runes_test.go:24: got '' (0x0)

In other words, asking for the first rune in the string "world" gives us the rune value 0, which is incorrect: it should be 'w'. Result, happiness.

Now that we’ve validated our bug detector, we can go ahead and write a slightly more plausible version of FirstRune (though, as we’ll see, there is still a problem lurking):

func FirstRune(s string) rune {
    return rune(s[0])
}

(Listing runes/2)

We use the square bracket slice-index notation (s[0]) to extract the first byte of s, and we return it having converted it explicitly to the rune type.

Though plausible, this is still incorrect, and you can see why, can’t you? This is the kind of code that someone might write, for example, if they didn’t know about the distinction between bytes and runes in Go that we talked about in Writing a Go fuzz target.

It might even work some of the time, but only by accident, and only when our input strings don’t contain any multi-byte runes.

Let’s see how far we get this time:

go test -fuzz .

fuzz: elapsed: 0s, gathering baseline coverage: 0/2 completed
fuzz: elapsed: 0s, gathering baseline coverage: 2/2 completed,
    now fuzzing with 8 workers
fuzz: minimizing 27-byte failing input file
fuzz: elapsed: 0s, minimizing
--- FAIL: FuzzFirstRune (0.16s)
    --- FAIL: FuzzFirstRune (0.00s)
        testing.go:1356: panic: runtime error: index out of range
        [0] with length 0
        [stack trace omitted]

It seems that the two seed corpus examples have now been successfully checked (2/2 completed), and the fuzzer announces now fuzzing with 8 workers, meaning that all eight CPU cores are concurrently generating inputs for the fuzz target. The more cores we have available, the more fuzzing we can do in parallel.

When we’re reporting a bug we found manually, it’s always worth taking a little time to find the minimal input that still reproduces the bug. This will help the maintainers, because they can ignore the irrelevant parts of the input and focus on the part that triggers the problem. By this shrinking process, we eliminate any part of the input that doesn’t affect the outcome.

The fuzzer does the same thing. If it finds some input that fails the test, it tries smaller and smaller versions of it, looking for the minimal failing input:

fuzz: minimizing 27-byte failing input file
fuzz: elapsed: 0s, minimizing

In this example, we can see that it found a 27-byte input string that caused the fuzz target to fail. It then tries smaller and smaller versions of the input to see if they still fail. This process continues until it has found the minimal input that causes the fuzz target to fail.

And here it is:

--- FAIL: FuzzFirstRune (0.16s)
    --- FAIL: FuzzFirstRune (0.00s)
        testing.go:1356: panic: runtime error: index out of range
        [0] with length 0

It seems that, in this case, the fuzzer managed to shrink the failing input to a completely empty string (length 0). This is rather puzzling. What’s gone on here?

In fact, what’s probably happened is that the fuzzer generated a random 27-byte string whose first rune was multi-byte, causing the function to return the wrong answer. As the fuzzer tried shorter and shorter versions of the string, the same “wrong result” failure kept happening.

Finally, when it reduced the string to length zero, it got a different failure: an “index out of range” panic. It so happens that FirstRune doesn’t cope well with empty input strings:

func FirstRune(s string) rune {
    return rune(s[0]) // oops, slice overrun
}

(Listing runes/2)

As I’m sure you spotted right away, the slice index expression s[0] will panic when s is empty, and indeed that’s what we’re seeing. Let’s fix that:

func FirstRune(s string) rune {
    if s == "" {
        return utf8.RuneError
    }
    return rune(s[0])
}

(Listing runes/3)

It seems reasonable to follow the example set by the standard library, which is to return utf8.RuneError in this case. Now the function won’t panic when given an empty input.

As you’ll recall from our earlier example, when the fuzzer finds a failing input, it automatically saves it to the testdata folder. This becomes part of the corpus, and even when we run go test without the -fuzz flag, this case will still be checked.

Let’s try that now, using the -v flag to go test to see which tests are actually being run, and whether our fuzz-generated cases are among them:

go test -v

=== RUN   FuzzFirstRune
=== RUN   FuzzFirstRune/seed#0
=== RUN   FuzzFirstRune/seed#1
=== RUN   FuzzFirstRune/xxx
    fuzz_test.go:61:
--- PASS: FuzzFirstRune (0.00s)
    --- PASS: FuzzFirstRune/seed#0 (0.00s)
    --- PASS: FuzzFirstRune/seed#1 (0.00s)
    --- SKIP: FuzzFirstRune/xxx (0.00s)

Again, xxx is actually a much longer name, but I’ve spared you that. We can see, though, that the two manually-added seeds are being checked (seed#0 and seed#1), and then we come to the generated case xxx, which we know is the empty string.

The result of that is neither pass nor fail, but SKIP. Why? Remember when we first wrote the test, we said that we’ll skip inputs for which DecodeRuneInString returns RuneError. That’s true for the empty string, which is why this case shows as skipped.

Detecting more subtle bugs

The fuzzer has already proved its worth, by detecting a panic in the FirstRune function when it’s called with the empty string as input. But are there still more bugs lurking?

In the previous example, we saw that the fuzzer found a 27-byte string that caused a test failure. That can’t be due to the slice overrun bug, so there must still be another bug remaining in FirstRune.

Let’s see if we can find it:

go test -fuzz .

fuzz: elapsed: 0s, gathering baseline coverage: 0/3 completed
fuzz: elapsed: 0s, gathering baseline coverage: 3/3 completed, now
fuzzing with 8 workers
fuzz: minimizing 32-byte failing input file
fuzz: elapsed: 0s, minimizing
--- FAIL: FuzzFirstRune (0.13s)
    --- FAIL: FuzzFirstRune (0.00s)
        fuzz_test.go:22: given "ʭ" (0xcaad): want 'ʭ' (0x2ad),
        got 'Ê' (0xca)

Again, interpreting this output, we can see that the fuzzer found a 32-byte failing input, which it then managed to shrink to just 2 bytes:

given "ʭ" (0xcaad)

That is, this input string consists of the rune ʭ, which in UTF-8 is encoded as two bytes, 0xca and 0xad. If FirstRune had correctly decoded this string, it would have returned the rune ʭ, the same result as the oracle.

Instead, it mistakenly interpreted the first byte of the string (0xca) as a single-byte rune, in this case the letter Ê.

A victory for fuzzing! Even if we didn’t know about the distinction between runes and bytes in Go, and even if we might not have thought of using a string like “ʭ” as an example input, we were still able to find the bug.

Fixing the implementation

How should we have implemented FirstRune correctly, then? Well, that’s not really important for the discussion of fuzz testing, but I don’t like to leave anyone in suspense. Let’s see if we can fix this bug.

There are a variety of ways we could write FirstRune correctly, including the rather dull one of simply calling utf8.DecodeRuneInString, as the test does.

But here’s a slightly more interesting version, just for fun:

func FirstRune(s string) rune {
    for _, r := range s {
        return r
    }
    return utf8.RuneError
}

(Listing runes/4)

Why does this work? As you probably know, the range operator iterates over a string by runes, not by bytes. So, on each successive execution of this loop, r will be the next rune in the string, starting with the first.

But on this very first iteration, we hit the return statement inside the loop body, and we return the first rune straight away. After all, that’s the right answer!

In the case where s is empty, the for loop will execute zero times; that is to say, not at all. We’ll go straight to the end of the function and return the value 0.

I think this is correct, but let’s see if the fuzzer agrees. Or, rather, we’ll give the fuzzer a chance to find some counterexamples, and see if it succeeds:

go test -fuzz .

fuzz: elapsed: 0s, gathering baseline coverage: 0/4 completed
fuzz: elapsed: 0s, gathering baseline coverage: 4/4 completed, now
fuzzing with 8 workers
fuzz: elapsed: 3s, execs: 130517 (43494/sec), new interesting: 5
(total: 13)
fuzz: elapsed: 6s, execs: 268452 (45962/sec), new interesting: 5
(total: 13)
fuzz: elapsed: 9s, execs: 396881 (42797/sec), new interesting: 5
(total: 13)
...

And, in fact, I left this to run for a minute or two, but it didn’t find any more failing inputs. That doesn’t guarantee that there aren’t any, of course. Even the fuzzer can’t guarantee that.

The aim of fuzz testing, in fact, is to fail to disprove the correctness of a function such as FirstRune for some reasonable period of time, while pointing to how hard we tried.

One interesting thing you might have noticed about this output is the word “interesting”. What does that mean?

fuzz: elapsed: 3s, execs: 130517 (43494/sec), new interesting: 5
(total: 13)

It turns out that the fuzzer does more than just generate random inputs. It also looks at which code paths are executed by those inputs. It considers an input “interesting” if it causes a new code path to be executed in the system under test.

After all, if there’s a code path that’s executed by no other test case, bugs could definitely be lurking there. So the fuzzer tries to discover inputs that cause new code paths to be executed. If it finds any, it uses them as a basis to generate similar inputs, hoping to find one that triggers a failure.

A question that might occur to you is how the fuzzer knows which code paths in your system are executed by any given test case. The answer is that it uses the same runtime instrumentation with which Go calculates your test coverage, and if you’d like to know more about that, go straight to the Happy Fun Books store (run, don’t walk) and buy The Power of Go: Tests.

You can do this: surviving day one

You can do this: surviving day one

Writing a Go fuzz target

Writing a Go fuzz target