The Oracle Problem
by Sam Galson • June 10th, 2019 • 13min
This article covers the same material as my recent talk at CityJSConf in slightly more depth. If you prefer video, check it out here:
Perambulatory prelude: what is an oracle?
We will begin with a little historical background on the relationship between oracles and testing. If you just want to jump straight into the tooling, feel free to skip this section. But you might find it interesting!
Oracles had their heyday in ancient Greece. Shrines dotted throughout the rugged mountains of that country regularly concealed, as one scholar puts it, “frenzied women from whose lips the god speaks”. Kings and philosophers would go to these oracles in search of answers to unanswerable questions.
The most famous of these women was of course the Oracle of Apollo at Delphi, who, as if anticipating future developments, presented herself as a sort of all-powerful computer, declaring:
I count the grains of sand on the shore and measure the sea
So the ancient historian Herodotus affirms.
It took a couple of millennia for oracles to catch on in computer science. The idea of an oracle was introduced to computer science by Alan Turing in his dissertation of 1937 as “some unspecified means of solving number-theoretic problems”. More broadly, an oracle was a black-box source of information that would enable a Turing machine (an abstract computer) to solve a problem without further computation. Oracles proved a useful concept for advancing certain lines of argument in theoretical computer science.
In his 1978 paper “Theoretical and Empirical Studies of Program Testing”, William Howden observed that Turing’s concept had a particular relevance to software testing:
In order to use testing to validate a program it is necessary to assume the existence of a test oracle which can be used to check the correctness of test output.
In other words, it is essential to the nature of an empirical test that the data by which the computer decides whether a program has passed or failed is supplied by something other than the program itself. This data is usually supplied by a human, although it need not be. In abstract, it is an oracle. Whether you know it or not, you have probably written hundreds if not thousands of oracles in your career.
The oracle problem
“The Oracle at Delphi neither speaks nor conceals, but gives a sign”
So runs the famous saying of the Greek philosopher Heraclitus. Relying on a human to supply an oracle — whether priestess or programmer — has always been a risky business. Somehow simple human expression never quite manages to capture the divine truths it is aiming for. Perhaps you too have compared the complexity of reality with the simplicity of a test oracle…and wept.
When the Greeks wished to remind themselves of the dangers of placing too much faith in an oracle, they recalled the story of King Croesus. Croesus based an ill fated expedition against the Persians on one of the oracle’s “signs”. As Herodotus recounts:
The god had declared that if Croesus attacked the Persians he would bring down a mighty empire. After an answer like that, the wise thing would have been to send again to inquire which empire was meant, the Persian’s or his own.
Authors of software can learn something from this tale. If you want confidence in your system, one oracle is not enough.
Without knowing it, Heraclitus was mourning what has come to be known in the software testing literature as “the oracle problem”. The oracle problem recognises that humans are strictly speaking never sufficiently equipped to test their programs: reality is always more complicated than its representation in a test suite.
In some particularly extreme cases the problem is not simply assembling enough oracles, but coming up with any oracles at all.
For example, how can you test a legacy system whose code is indecipherable and whose documentation is missing? What oracle could you supply? Some types of program are so complicated that even though they may be well specified, generating expected output without the aid of the program itself is tricky: compilers, transpilers, encryption algorithms, obfuscators and some machine learning applications fall under this category. For most web development, however, our problem is not that we cannot come up with test cases; it’s that no matter how many cases we come up with, it is never enough. Our program will still have bugs. And there is a limit to what we can afford to maintain.
There are only two ways to solve the oracle problem. The first is to become a god and test all possible real-word conditions at once. The second is to conjure a demon to do this for you. Although daemons are not really gods, they can imitate gods better than anything we have yet discovered.
Did I say daemon? Of course I meant computer program.
Automatically generating large numbers of oracles, sometimes randomly, sometimes with constraints, has a name: fuzz testing. There are many ways to make and work with fuzz:
- Cheat → artillery-plugin-fuzzer
- Fuzz blindly → zzuf, radamsa, sinkdweller, fuzz-me-maybe
- Evolve your fuzz → afl-fuzz, js-fuzz
- Fuzz with a grammar → dharma, octo, fuzzur, juzz
- Fuzz properties → test-check, babel-plugin-transform-flow-to-gen
- Fuzz models → fast-check
- Go meta! Fuzz test test suites → stryker
We shall consider these options in turn. For links to the libraries, please refer back to this list.
The simplest way to get started with fuzzing, and possibly the most likely to yield a good return on investment (since the investment is so small) is to cheat.
Cheating means not generating oracles automatically but using a BIG LIST OF NAUGHTY STRINGS, crowdsourced test cases which are known to cause issues for insufficiently robust systems, for example:
- 1;DROP TABLE users 1’; DROP TABLE users —
- Ṱ̺̺̕o͞ ̷i̲̬͇̪͙n̝̗͕v̟̜̘̦͟o̶̙̰̠kè͚̮̺̪̹̱̤ ̖t̝͕̳̣̻̪͞h̼͓̲̦̳̘̲e͇̣̰̦̬͎ ̢̼̻̱̘h͚͎͙̜̣̲ͅi̦̲̣̰̤v̻͍e̺̭̳̪̰-m̢iͅn̖̺̞̲̯̰d̵̼̟͙̩̼̘̳ ̞̥̱̳̭r̛̗̘e͙p͠r̼̞̻̭̗e̺̠̣͟s̘͇̳͍̝͉e͉̥̯̞̲͚̬͜ǹ̬͎͎̟̖͇̤t͍̬̤͓̼̭͘ͅi̪̱n͠g̴͉ ͏͉ͅc̬̟h͡a̫̻̯͘o̫̟̖͍̙̝͉s̗̦̲.̨̹͈̣
Install artillery and artillery-plugin-fuzzer from NPM for a quick way to fire naughty strings at your application. Artillery expects a piece of configuration like the following:
This tells Artillery to fire a username and password at the session endpoint, taking the username from the list of naughty strings. Artillery will automatically do this for every string on the list.
Simply verifying that your server does not crash should provide some additional confidence. However, the best results with this type of fuzzing (and the next) are obtained if the code under test contains runtime assertions. Runtime assertions can check things like preconditions (does the state required for the current operation exist?), postconditions (did the operation do what was expected?) and invariants (some things that should never change). With runtime assertions in place, fuzzy tests will catch not just unhandled exceptions, but also broken business logic.
2. Fuzz blindly
For truly machine-generated oracles, turn to the command line utilities zzuf or radamsa. They are mutational fuzzers, which means they accept some input and output a fuzzy (randomly altered) version of it. They pay no heed to datastructures or encoding, which has benefits and downsides. The benefit is that they are more likely to produce edgier edge cases; the downside is that because most inputs will be rejected right away, they will need to run much longer before they catch anything (sometimes weeks…).
The following script will produce 100 variants of the string “hello”:
Note that each time we run the program we pass it a seed. Most fuzzers generate their randomisations in a deterministic fashion based on a given seed. That means if we find some output that triggers a bug we can easily recreate it using the seed.
Radamsa works similarly to zzuf but has a different fuzzing style. It tends to produce more wildly varying output more quickly. You can also find a Node wrapper for radamsa on NPM (sinkweller), together with a handy module (fuzz-me-maybe) which allows you define an insertion point for the fuzzed data other than your program’s main entrypoint:
Data passed into output will now be fuzzed before being passed to stream.write. Using this library, fuzzing can be turned on and off with environment variables.
Again, remember that this sort of fuzzing benefits greatly from being combined with runtime assertions.
3. Evolve your fuzz
What if instead of fuzzing blindly we had some way of automatically looking into our program and figuring out what types of input are most likely to cause interesting behaviours. We do, and its very simple: coverage analysis. A group of fuzzers sometimes known as “genetic fuzzers” keep track of the code coverage generated by each input and select the inputs which increase code coverage as material for further mutations. The oldest and most famous of these is American Fuzzy Lop (AFL), whose algorithm is sketched below:
4. Fuzz with a grammar
You can pursue a more targeted fuzzing strategy by using a grammar to define precisely the shape of the fuzz that you want to see. Mozilla’s dharma is the simplest I have found. It is generative rather than mutational: you write a specification for the kind of data that you wish to produce and dharma will spew out data to satisfy the specification. It does not need any input data to “mutate”.
5. Fuzz properties
In this section we will look at property-based testing or generative testing (the two are synonymous), which sits somewhere between the fuzzing techniques described above and normal unit testing.
What is property-based testing? It is easiest to understand by looking at an example. Let’s take test-check. Here we are testing a function that is supposed to return the sum of two numbers or ten, whichever is higher. The developer accidentally commented out the functionality to make this work:
The test has two arguments: an array of generators and an assertion. Here we ask for two randomly generated integers with gen.int, use them to call our function and assert on the result. Plugins exist for integrating such tests into different testing libraries. Using Jest, for example, we can write the same test like this:
A wide variety of generators are available corresponding to different general types such as string, number, boolean, array, but also more specific types such as posInt, negInt, asciiString, JSON. It is also possible to define custom generators using combinations of these or any other generation tool (you could combine it with dharma, for example).
The output of this test looks like this:
The key fail records that the test failed with input [21, 8]. What about the key shrunk? Here we approach what makes property-based testing really powerful. A common feature of these libraries called “shrinking” means that when a failure is detected a search is performed to find the smallest value that will trigger the failure. In this case [3, 8] was found. Having a minimal test-case often makes it much easier to locate the source of the problem.
A potentially annoying feature of property-based testing is the need to define the input types for the functions under test. If you are using Flow or Typescript, this is particularly vexing as the types have already been defined once. Some headway has been made to tackle this problem. If you are using Flow, try babel-plugin-transform-flow-to-gen on NPM: it will enable you to convert your types into generators for test-check, although, be warned, the library is not actively developed. I am not aware of anything comparable for Typescript (again, if you are looking for a project…). If you are using Joi you can easily create a generator to do property-based testing of your endpoints with juzz.
The really difficult part of property-based testing, which takes practice, is finding good assertions. In particular, it is important to avoid simply reimplementing the code under test. The following are some categories of useful assertion that you can use for inspiration:
- Boundaries: test that a result always falls within certain known bounds. This was our example with addTenMax.
- Complex optimisations: if some code does something simple in a very complicated way to make it fast, you can include the simple implementation in your test and check the outputs match.
- Complementary pairs: if you have pairs of functions such as encode and decode you can check that encoding then decoding reproduces the original input.
- Idempotency: you can check that calling a function more than once does not alter its output (unless it is supposed to).
- Commutativity: you can check that the order of arguments does not alter the result. For example, sum functions should behave this way.
- Invariants: you can check invariants peculiar to your business logic.
All the indications are that property-based testing has a lot to offer. The following claim from the Clojure documentation puts the case fairly bluntly:
Property-based, generative testing, as implemented for Clojure in test.check, has proved to be far more powerful than manually written tests.
6. Model-based testing
Model-based testing is a form of generative testing designed for applications with complicated state, such as UIs. Many bugs in such systems require a complicated series of steps to be repeated in order to get the application into a faulty state. For example, a login page may function correctly most of the time, but may fail if the user takes a specific, complicated path through the application. To cover such scenarios, instead of generating inputs to functions, we generate operation sequences. We then check that the operations work as expected no matter what position in a sequence they hold.
Concretely, this approach requires us to define some operations we can do in the application along with their preconditions and postconditions. For example, the operation ‘click login’ might have a precondition ‘is on the login page’ and a postcondition, ‘is logged in’. The testing framework generates a valid random sequence of operations ensuring that the ‘click login’ operation always follows another operation whose postcondition is ‘is on the login page’. If any postcondition fails, the test has detected a bug.
Next time you are building user journeys with TestCafe or Cypress, spare a thought for fast-check, which can generate these for you!
7. Fuzz test your tests
That is not all: the difficulty of knowing whether tests provide value actually cuts much deeper than the particular issue with fuzz testing.
The popularity of test coverage monitoring indicates a widespread desire to monitor how well an application is tested, yet coverage is a notoriously poor indicator of test quality. Coverage measures how much code was executed, but ignores whether that code was actually asserted against. The constant argy-bargy over what constitutes good testing, over when testing is too much, or not enough, or too integration-y, or too unit-y, reflects a widespread uncertainty concerning the impact any tests really have. Wouldn’t it be nice if we had some way to objectively measure — or test — the quality of our tests? A classic oracle problem…
Bugs, or mutants, are automatically inserted into your production code. Your tests are ran for each mutant. If your tests fail then the mutant is killed. If your tests passed, the mutant survived. The higher the percentage of mutants killed, the more effective your tests are…It’s really that simple.
This might sound a bit odd. Stryker will deliberately insert errors into your code, change return values, change one operator to a different one, etc…Think of it as mutational fuzz testing, but the input is your code! The idea — and it is a great one — is that if your tests continue to pass after your code is mutated, they are bad tests.
Stryker is, again, not a mature library, although it is being actively developed and has fairly considerable weight behind it. It is well worth a try.
I am seriously interested in hearing about your experiences if you managed to try any of these tools on a real project. If you merely enjoyed the read, please do clap 👏 , share, and follow me on twitter 🐦! Thank you!
Fuzz testing is a massive topic and I will not attempt a bibliography. The documentation of the various libraries discussed here is a good place to start. For those who want to go deeper here is one recommendation: The Fuzzing Book.
Written by Sam Galson • June 1st, 2019
Share this article