The Oracle Problem


This article surveys the current state of JavaScript tooling for a class of testing techniques which address something called “the oracle problem”. The class is coextensive with fuzz testing broadly defined, including mutational fuzzing, generative testing, property-based testing, model-based testing and mutation testing. While these ways of testing have found devotees in other ecosystems, none are much used or even known about in the JavaScript community. Not so long ago one could have said the same about unit testing. When will fuzz testing get its unit testing moment?

JavaScript libraries to enable these sorts of testing are still immature, but now reaching a point where it is possible for developers to try them out and come to their own opinions about the value of these testing approaches. It is time to start trialling these libraries on real applications so that we can better gauge the value of the techniques they make possible.

This article covers the same material as my recent talk at CityJSConf in slightly more depth. If you prefer video, check it out here:

Perambulatory prelude: what is an oracle?

We will begin with a little historical background on the relationship between oracles and testing. If you just want to jump straight into the tooling, feel free to skip this section. But you might find it interesting!

Greek oracles

Oracles had their heyday in ancient Greece. Shrines dotted throughout the rugged mountains of that country regularly concealed, as one scholar puts it, “frenzied women from whose lips the god speaks”. Kings and philosophers would go to these oracles in search of answers to unanswerable questions.

The Oracle of Apollo at Delphi by by Camilo Miola

The most famous of these women was of course the Oracle of Apollo at Delphi, who, as if anticipating future developments, presented herself as a sort of all-powerful computer, declaring:

I count the grains of sand on the shore and measure the sea

So the ancient historian Herodotus affirms.

Turing’s oracle

It took a couple of millennia for oracles to catch on in computer science. The idea of an oracle was introduced to computer science by Alan Turing in his dissertation of 1937 as “some unspecified means of solving number-theoretic problems”. More broadly, an oracle was a black-box source of information that would enable a Turing machine (an abstract computer) to solve a problem without further computation. Oracles proved a useful concept for advancing certain lines of argument in theoretical computer science.

Test oracles

In his 1978 paper “Theoretical and Empirical Studies of Program Testing”, William Howden observed that Turing’s concept had a particular relevance to software testing:

In order to use testing to validate a program it is necessary to assume the existence of a test oracle which can be used to check the correctness of test output.

In other words, it is essential to the nature of an empirical test that the data by which the computer decides whether a program has passed or failed is supplied by something other than the program itself. This data is usually supplied by a human, although it need not be. In abstract, it is an oracle. Whether you know it or not, you have probably written hundreds if not thousands of oracles in your career.

The oracle problem

“The Oracle at Delphi neither speaks nor conceals, but gives a sign”

So runs the famous saying of the Greek philosopher Heraclitus. Relying on a human to supply an oracle — whether priestess or programmer — has always been a risky business. Somehow simple human expression never quite manages to capture the divine truths it is aiming for. Perhaps you too have compared the complexity of reality with the simplicity of a test oracle…and wept.

The Latin reads: “Weeping, Heraclitus gazes on the games of our miserable life and continuously grieves the human condition”

When the Greeks wished to remind themselves of the dangers of placing too much faith in an oracle, they recalled the story of King Croesus. Croesus based an ill fated expedition against the Persians on one of the oracle’s “signs”. As Herodotus recounts:

The god had declared that if Croesus attacked the Persians he would bring down a mighty empire. After an answer like that, the wise thing would have been to send again to inquire which empire was meant, the Persian’s or his own.

Authors of software can learn something from this tale. If you want confidence in your system, one oracle is not enough.

Without knowing it, Heraclitus was mourning what has come to be known in the software testing literature as “the oracle problem”. The oracle problem recognises that humans are strictly speaking never sufficiently equipped to test their programs: reality is always more complicated than its representation in a test suite.

In some particularly extreme cases the problem is not simply assembling enough oracles, but coming up with any oracles at all.

Abandon hope all ye who enter here

For example, how can you test a legacy system whose code is indecipherable and whose documentation is missing? What oracle could you supply? Some types of program are so complicated that even though they may be well specified, generating expected output without the aid of the program itself is tricky: compilers, transpilers, encryption algorithms, obfuscators and some machine learning applications fall under this category. For most web development, however, our problem is not that we cannot come up with test cases; it’s that no matter how many cases we come up with, it is never enough. Our program will still have bugs. And there is a limit to what we can afford to maintain.

There are only two ways to solve the oracle problem. The first is to become a god and test all possible real-word conditions at once. The second is to conjure a demon to do this for you. Although daemons are not really gods, they can imitate gods better than anything we have yet discovered.

Did I say daemon? Of course I meant computer program.

Fuzz!

Automatically generating large numbers of oracles, sometimes randomly, sometimes with constraints, has a name: fuzz testing. There are many ways to make and work with fuzz:

  1. Cheat → artillery-plugin-fuzzer
  2. Fuzz blindly → zzuf, radamsa, sinkdweller, fuzz-me-maybe
  3. Evolve your fuzz → afl-fuzz, js-fuzz
  4. Fuzz with a grammar → dharma, octo, fuzzurjuzz
  5. Fuzz properties → test-check, babel-plugin-transform-flow-to-gen
  6. Fuzz models → fast-check
  7. Go meta! Fuzz test test suites → stryker

We shall consider these options in turn. For links to the libraries, please refer back to this list.

1. Cheat

The simplest way to get started with fuzzing, and possibly the most likely to yield a good return on investment (since the investment is so small) is to cheat.

Cheating means not generating oracles automatically but using a BIG LIST OF NAUGHTY STRINGS, crowdsourced test cases which are known to cause issues for insufficiently robust systems, for example:

  • 1;DROP TABLE users 1’; DROP TABLE users —
  • Ṱ̺̺̕o͞ ̷i̲̬͇̪͙n̝̗͕v̟̜̘̦͟o̶̙̰̠kè͚̮̺̪̹̱̤ ̖t̝͕̳̣̻̪͞h̼͓̲̦̳̘̲e͇̣̰̦̬͎ ̢̼̻̱̘h͚͎͙̜̣̲ͅi̦̲̣̰̤v̻͍e̺̭̳̪̰-m̢iͅn̖̺̞̲̯̰d̵̼̟͙̩̼̘̳ ̞̥̱̳̭r̛̗̘e͙p͠r̼̞̻̭̗e̺̠̣͟s̘͇̳͍̝͉e͉̥̯̞̲͚̬͜ǹ̬͎͎̟̖͇̤t͍̬̤͓̼̭͘ͅi̪̱n͠g̴͉ ͏͉ͅc̬̟h͡a̫̻̯͘o̫̟̖͍̙̝͉s̗̦̲.̨̹͈̣

Install artillery and artillery-plugin-fuzzer from NPM for a quick way to fire naughty strings at your application. Artillery expects a piece of configuration like the following:

This tells Artillery to fire a username and password at the session endpoint, taking the username from the list of naughty strings. Artillery will automatically do this for every string on the list.

Simply verifying that your server does not crash should provide some additional confidence. However, the best results with this type of fuzzing (and the next) are obtained if the code under test contains runtime assertions. Runtime assertions can check things like preconditions (does the state required for the current operation exist?), postconditions (did the operation do what was expected?) and invariants (some things that should never change). With runtime assertions in place, fuzzy tests will catch not just unhandled exceptions, but also broken business logic.

2. Fuzz blindly

For truly machine-generated oracles, turn to the command line utilities zzuf or radamsa. They are mutational fuzzers, which means they accept some input and output a fuzzy (randomly altered) version of it. They pay no heed to datastructures or encoding, which has benefits and downsides. The benefit is that they are more likely to produce edgier edge cases; the downside is that because most inputs will be rejected right away, they will need to run much longer before they catch anything (sometimes weeks…).

The following script will produce 100 variants of the string “hello”:

Note that each time we run the program we pass it a seed. Most fuzzers generate their randomisations in a deterministic fashion based on a given seed. That means if we find some output that triggers a bug we can easily recreate it using the seed.

Radamsa works similarly to zzuf but has a different fuzzing style. It tends to produce more wildly varying output more quickly. You can also find a Node wrapper for radamsa on NPM (sinkweller), together with a handy module (fuzz-me-maybe) which allows you define an insertion point for the fuzzed data other than your program’s main entrypoint:

Data passed into output will now be fuzzed before being passed to stream.write. Using this library, fuzzing can be turned on and off with environment variables.

Again, remember that this sort of fuzzing benefits greatly from being combined with runtime assertions.

3. Evolve your fuzz

What if instead of fuzzing blindly we had some way of automatically looking into our program and figuring out what types of input are most likely to cause interesting behaviours. We do, and its very simple: coverage analysis. A group of fuzzers sometimes known as “genetic fuzzers” keep track of the code coverage generated by each input and select the inputs which increase code coverage as material for further mutations. The oldest and most famous of these is American Fuzzy Lop (AFL), whose algorithm is sketched below:

Schematic of AFL algorithm

Because AFL relies on code instrumentation to track coverage, every language requires its own port of the program. The original AFL was written for C applications. It is possible to use this on JavaScript applications today by compiling Node from scratch with the AFL tools for C++, but this approach is not recommended. Fully functioning AFL ports exist for Go, Rust and Python: why not JavaScript?

Popularity of AFL ports in different languages

The best I have been able to find in JavaScript is connor4312/js-fuzz. This library is complete enough that it may function for some situations, but it needs more work and you are likely to experience bugs. Now would be a great time to get stuck in!

4. Fuzz with a grammar

You can pursue a more targeted fuzzing strategy by using a grammar to define precisely the shape of the fuzz that you want to see. Mozilla’s dharma is the simplest I have found. It is generative rather than mutational: you write a specification for the kind of data that you wish to produce and dharma will spew out data to satisfy the specification. It does not need any input data to “mutate”.

Writing a proper specification for an input type can be a little tricky but luckily various grammars already exist. Here is one for URLs, for example, that runs to 326 lines. Dharma can also be used to generate valid strings of JavaScript that you can eval in your application context. Check out this example of using dharma to generate valid invocations of a JavaScript function.

Now, that might seem like a long winded way to generate native JavaScript datatypes, and it is. Why not generate them directly inside the JavaScript context? Another library from Mozilla, octo does in fact include some in-JavaScript generators for things like URLs but is rather limited in scope. On the other hand, if the extent of your desire for grammar is to restrict output to certain specific JavaScript native types, and you are not concerned about things like forming valid URLs, you have a few more options. The simplest is fuzzur which will just take a JavaScript object and mutate it, preserving the types. More elaborate and potentially more useful is juzz, which promises to generate objects conforming to joi specifications. It works well for many Joi types but it is incomplete and not in active development. Why not pick it up? (Notice a pattern here!?)

Another group of testing libraries takes the idea of generating native JavaScript types much further, providing more fine-grained tools for doing so and including a harness for doing something called “property-based testing”. Let’s take a look at them now.

5. Fuzz properties

In this section we will look at property-based testing or generative testing (the two are synonymous), which sits somewhere between the fuzzing techniques described above and normal unit testing.

What is property-based testing? It is easiest to understand by looking at an example. Let’s take test-check. Here we are testing a function that is supposed to return the sum of two numbers or ten, whichever is higher. The developer accidentally commented out the functionality to make this work:

The test has two arguments: an array of generators and an assertion. Here we ask for two randomly generated integers with gen.int, use them to call our function and assert on the result. Plugins exist for integrating such tests into different testing libraries. Using Jest, for example, we can write the same test like this:

A wide variety of generators are available corresponding to different general types such as string, number, boolean, array, but also more specific types such as posInt, negInt, asciiString, JSON. It is also possible to define custom generators using combinations of these or any other generation tool (you could combine it with dharma, for example).

The output of this test looks like this:

The key fail records that the test failed with input [21, 8]. What about the key shrunk? Here we approach what makes property-based testing really powerful. A common feature of these libraries called “shrinking” means that when a failure is detected a search is performed to find the smallest value that will trigger the failure. In this case [3, 8] was found. Having a minimal test-case often makes it much easier to locate the source of the problem.

A potentially annoying feature of property-based testing is the need to define the input types for the functions under test. If you are using Flow or Typescript, this is particularly vexing as the types have already been defined once. Some headway has been made to tackle this problem. If you are using Flow, try babel-plugin-transform-flow-to-gen on NPM: it will enable you to convert your types into generators for test-check, although, be warned, the library is not actively developed. I am not aware of anything comparable for Typescript (again, if you are looking for a project…). If you are using Joi you can easily create a generator to do property-based testing of your endpoints with juzz.

The really difficult part of property-based testing, which takes practice, is finding good assertions. In particular, it is important to avoid simply reimplementing the code under test. The following are some categories of useful assertion that you can use for inspiration:

  • Boundaries: test that a result always falls within certain known bounds. This was our example with addTenMax.
  • Complex optimisations: if some code does something simple in a very complicated way to make it fast, you can include the simple implementation in your test and check the outputs match.
  • Complementary pairs: if you have pairs of functions such as encode and decode you can check that encoding then decoding reproduces the original input.
  • Idempotency: you can check that calling a function more than once does not alter its output (unless it is supposed to).
  • Commutativity: you can check that the order of arguments does not alter the result. For example, sum functions should behave this way.
  • Invariants: you can check invariants peculiar to your business logic.

If you are considering trying fuzzy testing for the first time, property-based testing is an excellent place to start. The feedback loop is much faster than for higher-level fuzzing and the JavaScript tooling is in a better state. Property based testing has also achieved a relatively high adoption rate in other ecosystems, notably Haskell and Clojure which include property based testing tools in their core libraries. Python, too, has a property-based testing library (hypothesis) that outstrips JavaScript’s offerings in terms of popularity and polish:

All the indications are that property-based testing has a lot to offer. The following claim from the Clojure documentation puts the case fairly bluntly:

Property-based, generative testing, as implemented for Clojure in test.check, has proved to be far more powerful than manually written tests.

Proved to be far more powerful? Have we been missing a trick in JavaScript land?

You will note that fast-check is the most recently updated JavaScript library, although the least popular. It is probably the most promising option at the moment and is particularly notable for including an exciting feature called “model-based testing”, which we shall look at now.

6. Model-based testing

Model-based testing is a form of generative testing designed for applications with complicated state, such as UIs. Many bugs in such systems require a complicated series of steps to be repeated in order to get the application into a faulty state. For example, a login page may function correctly most of the time, but may fail if the user takes a specific, complicated path through the application. To cover such scenarios, instead of generating inputs to functions, we generate operation sequences. We then check that the operations work as expected no matter what position in a sequence they hold.

Concretely, this approach requires us to define some operations we can do in the application along with their preconditions and postconditions. For example, the operation ‘click login’ might have a precondition ‘is on the login page’ and a postcondition, ‘is logged in’. The testing framework generates a valid random sequence of operations ensuring that the ‘click login’ operation always follows another operation whose postcondition is ‘is on the login page’. If any postcondition fails, the test has detected a bug.

Model-based testing algorithm (diagram by Nicolas Dubien)

Next time you are building user journeys with TestCafe or Cypress, spare a thought for fast-check, which can generate these for you!

7. Fuzz test your tests

The major obstacle standing against the adoption of all of these forms of oracle-problem-busting test techniques is the general lack of experience with them in JavaScript community. Without examples of successful applications of the techniques on JavaScript projects, and without battle-hardened tools, we can hardly be sure they are worth the time. Companies need to invest in both R and D in this area.

That is not all: the difficulty of knowing whether tests provide value actually cuts much deeper than the particular issue with fuzz testing.

The popularity of test coverage monitoring indicates a widespread desire to monitor how well an application is tested, yet coverage is a notoriously poor indicator of test quality. Coverage measures how much code was executed, but ignores whether that code was actually asserted against. The constant argy-bargy over what constitutes good testing, over when testing is too much, or not enough, or too integration-y, or too unit-y, reflects a widespread uncertainty concerning the impact any tests really have. Wouldn’t it be nice if we had some way to objectively measure — or test — the quality of our tests? A classic oracle problem…

Enter “mutation testing”. To explain what mutation testing is I cannot do better than quote the introduction on the website for stryker, currently the only available mutation testing toolkit for JavaScript:

Bugs, or mutants, are automatically inserted into your production code. Your tests are ran for each mutant. If your tests fail then the mutant is killed. If your tests passed, the mutant survived. The higher the percentage of mutants killed, the more effective your tests are…It’s really that simple.

This might sound a bit odd. Stryker will deliberately insert errors into your code, change return values, change one operator to a different one, etc…Think of it as mutational fuzz testing, but the input is your code! The idea — and it is a great one — is that if your tests continue to pass after your code is mutated, they are bad tests.

Stryker is, again, not a mature library, although it is being actively developed and has fairly considerable weight behind it. It is well worth a try.

I am seriously interested in hearing about your experiences if you managed to try any of these tools on a real project. If you merely enjoyed the read, please do clap 👏 , share, and follow me on twitter 🐦! Thank you!

Further Reading

Fuzz testing is a massive topic and I will not attempt a bibliography. The documentation of the various libraries discussed here is a good place to start. For those who want to go deeper here is one recommendation: The Fuzzing Book.


Written by Sam GalsonJune 1st, 2019


Share this article