Comment by ncruces - Hacker Neue

ncruces Jul 4, 2024 parent

With the advent of coverage based fuzzing, and how well supported it is in Go, what am I missing from not using one of the property based testing libraries?

https://www.tedinski.com/2018/12/11/fuzzing-and-property-tes...

Like, with the below fuzz test, and the corresponding invariant checks, isn't this all but equivalent to property tests?

https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...

hyperpape Jul 4, 2024

The distinction between property based testing and fuzzing is basically just a rough cluster of vibes. It describes a real difference, but the borders are pretty vague and precisely deciding which things are fuzzing and which are PBT isn’t really that critical.

- Quick running tests, detailed assertions —> PBT

- Longer tests, just looking for a crash —> fuzzing.

- In between, who knows?

https://hypothesis.works/articles/what-is-property-based-tes...

ncruces OP Jul 5, 2024

Interesting read. I guess I'm mostly interested in on number 2 here:

    Under this point of view, a property-based testing library is really two parts:
    1. A fuzzer.
    2. A library of tools for making it easy to construct property-based tests using that fuzzer.

What should I (we?) be building on top of Go's fuzzer to make it easier to construct property-based tests?

My current strategy is to interpret a byte stream as a sequence of "commands", let those be transitions that drive my "state machine", then test all invariants I can think of at every step.

PartiallyTyped Jul 5, 2024

I don’t know much about that specific fuzzer; but all you need is a/ consistent and controlled randomness, b/ combinators, c/ reducers.

What you describe (commands) is commonly used with model based testing and distributed systems.

westurner Jul 5, 2024

TLA+, Formal Methods in Python: FizzBee, Nagini, Deal-solver, Dafny: https://www.hackerneue.com/item?id=39938759 :

> Python is Turing complete, but does [TLA,] need to be? Is there an in-Python syntax that can be expanded in place by tooling for pretty diffs; How much overlap between existing runtime check DbC decorators and these modeling primitives and feature extraction transforms should there be? (In order to: minimize cognitive overload for human review; sufficiently describe the domains, ranges, complexity costs, inconstant timings, and the necessary and also the possible outcomes given concurrency,)

From "S2n-TLS – A C99 implementation of the TLS/SSL protocol" https://www.hackerneue.com/item?id=38510025 :

> But formal methods (and TLA+ for distributed computation) don't eliminate side channels. [in CPUs e.g. with branch prediction, GPUs, TPUs/NPUs, Hypervisors, OS schedulers, IPC,]

Still though, coverage-based fuzzing;

From https://www.hackerneue.com/item?id=30786239 :

> OSS-Fuzz runs CloudFuzz[Lite?] for many open source repos and feeds OSV OpenSSF Vulnerability Format: https://github.com/google/osv#current-data-sources

From "Automated Unit Test Improvement using Large Language Models at Meta" https://www.hackerneue.com/item?id=39416628 :

> "Fuzz target generation using LLMs" (2023), OSSF//fuzz-introspector*

> gh topic: https://github.com/topics/coverage-guided-fuzzing

The Fuzzing computational task is similar to the Genetic Algorithm computational task, in that both explore combinatorial Hilbert spaces of potentially infinite degree and thus there is need for parallelism and thus there is need for partitioning for distributed computation. (But there is no computational oracle to predict that any particular sequence of combinations of inputs under test will deterministically halt on any of the distributed workers, so second-order methods like gradient descent help to skip over apparently desolate territory when the error hasn't changed in awhile)

The Fuzzing computational task: partition the set of all combinations of inputs for distributed execution with execute-once or consensus to resolve redundant results.

DbC Design-By-Contract patterns include Preconditions and Postconditions (which include tests of Invariance)

We test Preconditions to exclude Inputs that do not meet the specified Ranges, and we verify the Ranges of Outputs in Postconditions.

We test Invariance to verify that there haven't been side-effects in other scopes; that variables and their attributes haven't changed after the function - the Command - returns.

DbC: https://en.wikipedia.org/wiki/Design_by_contract :

> Design by contract has its roots in work on formal verification, formal specification and Hoare logic.

TLA+ > Language: https://en.wikipedia.org/wiki/TLA%2B#Language

Formal verification: https://en.wikipedia.org/wiki/Formal_verification

From https://www.hackerneue.com/item?id=38138319 :

> Property testing: https://en.wikipedia.org/wiki/Property_testing

> awesome-python-testing#property-based-testing: https://github.com/cleder/awesome-python-testing#property-ba...

> Fuzzing: https://en.wikipedia.org/wiki/Fuzzing

Software testing > Categorization > [..., Property testing, Metamorphic testing] https://en.wikipedia.org/wiki/Software_testing#Categorizatio...

From https://www.hackerneue.com/item?id=28494885#28513982 :

> https://github.com/dafny-lang/dafny #read-more

> Dafny Cheat Sheet: https://docs.google.com/document/d/1kz5_yqzhrEyXII96eCF1YoHZ...

> Looks like there's a Haskell-to-Dafny converter.

haskell2dafny: https://gitlab.doc.ic.ac.uk/dcw/haskell-subset-to-dafny-tran...

Controlled randomness: tests of randomness, random uniform not random norm, rngd, tests of randomness:

From https://www.hackerneue.com/item?id=40630177 :

> google/paranoid_crypto.lib.randomness_tests: https://github.com/google/paranoid_crypto/tree/main/paranoid... docs: https://github.com/google/paranoid_crypto/blob/main/docs/ran...

randomdata Jul 5, 2024

> What should I (we?) be building on top of Go's fuzzer to make it easier to construct property-based tests?

What's the intent behind building on top of Go's fuzzer as opposed to an approach like testing/quick takes?

ncruces OP Jul 5, 2024

The testing/quick package is frozen and is not accepting new features.

Also: https://www.hackerneue.com/item?id=30385195

But, it occurs to me I could try and build on the same API, and compare them for effectiveness.

randomdata Jul 5, 2024

> The testing/quick package is frozen and is not accepting new features.

Indeed. And respect to software that knows when it is complete, but is there some relevance to this here?

> Also: https://www.hackerneue.com/item?id=30385195

While the distinction between fuzzing and PBT is, er, fuzzy, what is unquestionable is that PBT is not meant to be coverage driven. It is a means to document behaviour for other developers in a way that is, in some cases, more descriptive than providing a fixed set of examples or other "traditional" ways of providing such documentation. Seeking coverage and "interesting" cases, rather than providing documentation, is decidedly the role of fuzzing.

yen223 Jul 5, 2024

Unit tests are fuzzers if your system is chaotic enough

kccqzy Jul 4, 2024

You can totally combine coverage based fuzzing with property based tests. When I was at Google, I really enjoyed their internal tooling for combining both. You simply write a property based test as usual, but when it comes to execution, the testing framework compiles your test in a special way to get coverage and then adjust the random input to hit increased coverage. Of course they run the test across a cluster of machines completely automatically.

Traditional property based testing is implemented simply as a library, so they don't necessarily have coverage information to guide their random input generation.

AlexErrant Jul 4, 2024

You're asserting properties, so IMO this meets the definition of PBTs ("every node of level greater than one has two children").

However, depending on the lib, you can get some nice quality of life improvements. One "nice to have" is shrinking. See the "Shrinking" section here https://tech.fpcomplete.com/blog/quickcheck-hedgehog-validit...

Having combinators to compose generators is also great.

Libs may also have a known set of "bad" values that cause exceptional behavior.

ncruces OP Jul 4, 2024

The Go fuzzer, when it finds a failure, will also walk back and try to shrink inputs (and still trigger the same failure).

Not sure how effective it is compared to other options, but I'm not totally missing out there.

It also builds a corpus of interesting inputs over time (those that cause new branches to be taken, since that's its goal: improve coverage).

I'm less sure about combinators.

gavinhoward Jul 4, 2024

AFL++, a fuzzer, has a tool to minimize (shrink) test cases.

zarathustreal Jul 4, 2024

“Properties” in the Property-based Testing sense refers to mathematical properties such as equality, associativity, commutivity, etc

mrkeen Jul 4, 2024

I'm not sure how Go's fuzz tests differ from what you linked in your article, but you article said proper fuzzers need to run for days or weeks, and that PBT should approximately always chosen over fuzz testing.

But I'd take one step back, and ask a more meta question about testing: does a successful test mean successful code, and vice-versa? Is there anything in Go's contract that specifies that the same inputs to the same code will yield the same output?

randomdata Jul 5, 2024

> does a successful test mean successful code

A successful test implies that the documentation is true. That is, after all, why you are writing tests: Documentation – To explain to future developers the intent and usage behind the code you wrote. You could just as well write the documentation in Word instead, but then you'd lose the ability to perform machine verification of the documentation, which is a huge boon to moving codebases where the documentation and code are likely to fall out of sync if one is not incredibly careful (although perhaps LLMs will some day bridge that gap?).

The documentation being true does not imply that the documentation is useful, of course. Writing useful documentation is still an art.

tennis_80 Jul 5, 2024

This is a reason I’m very keen on making sure tests are focused on requirements rather than code.

A few jobs ago I would often be in a team where the entire team had turned over several times, we would be asked to do large updates to a legacy application or bump lots of dependencies and just not break anything. When pushed the product owner wouldn’t be able to describe what the app was supposed to do, particularly for unusual types of users (the account / user modelling was chaotic so say, several billing accounts per user, each with different products and access levels). At that point “foo calls bar” doesn’t clarify much intent.

skybrian Jul 4, 2024

As far as the API is concerned, the main thing you get is a combinator library for generating whatever random data structures you want to use. Working with Arbitrary types (which represent sets of random objects) makes it easy to write reusable functions for generating your test inputs. A library like that could probably be used along with Go’s fuzzing framework fairly easily?

I still think the usual combinators (map, filter, chain, and oneOf, to use fast-check’s names) can be a bit awkward, so I’m writing a new property testing library for JavaScript that hopefully works a bit nicer. (Experimental and not published yet.)

bananapub Jul 4, 2024

fuzzing is clearly not a replacement for tests?

josephg Jul 4, 2024

It kind of is. I often use randomised tests (fuzzing) for data structures and algorithm implementations. Throw a lot of asserts in and see if an hour of random input can find any edge cases I’ve missed. Usually the first time I run something like this it finds problems instantly. It’s very humbling.

I find when I do this I don’t need to write by hand anywhere near as many tests to get my software working well. I also usually turn any failures found by the fuzzer into standalone unit tests, to make any regressions easier to find later.

ncruces OP Jul 4, 2024

I agree, it can be. My example above matches yours (fuzzing a data structure).

I coded some tests that ensure the data structure is useful, many of them test examples from papers describing the data structure, but that don't necessarily cover all the corner cases.

Then I fuzzed it. I used Go's fuzzer, which is geared towards parsers and stuff. It can generate a stream of bytes and use that for fuzzing. The data structure is a set/map. So I interpret the stream of bytes as commands to add/remove random elements from the set/map. After I add an element, contains needs to return true; after I remove one, contains need to return false; if I put in a mapping, finding it must return what I just put there, etc. And at every step all the data structure invariants (that ensure logarithmic search, etc) need to hold.

That was stupidly effective at finding a few bugs, all within seconds, all with sequences of less than a dozen operations. Then it stops. And you get 100% coverage.

I'm assuming that, apart from ergonomics, where I kinda build my own state machine transitions out of a stream of bytes, the tooling actually seems more effective than property testing libraries.

Still curious to understand what I'm missing out.

chriswarbo Jul 5, 2024

There's a fun property-testing library for Haskell called LazySmallcheck (and a more powerful version called LazySmallcheck2012) which is sort-of coverage-based. Since Haskell values are lazy, it uses exceptions as its test data: if a particular exception gets thrown, the test must have forced that part of the test data, so it replaces that part and tries again.

For example, let's say our test data is an assoc-list like `[(Int, Customer)]`, and there's a bug that's triggered by repeated keys. LazySmallcheck will run the test on the most general assoc-list there is, a _|_ value!

  myTest (exception A)  --> throws A

Since A was thrown, LSC will try again; making the list slightly more specific; which means trying both constructors `[]` and `:`:

  myTest []                         --> test passes
  myTest (exception A:exception B)  --> throws A

The input `[]` is completely specified, and it didn't lead to any counterexample, so LSC is finished with that branch. In the `:` branch, the general pair `exception A` caused an error, so LSC will make it slightly more specific and try again:

  myTest ((exception A, exception C):exception B)  --> throws A

Since `exception A` is an Int, LSC can't use laziness to keep things general; it has to try concrete values, so it proceeds like Smallcheck; trying the "smallest" Int values:

  myTest ((0, exception C):exception B)  --> throws B

Again, LSC will make the general list `exception B` more specific:

  myTest ((0, exception C):[])  --> passes
  myTest ((0, exception C):exception A:exception B)  --> throws A

Since the input `(0, exception C):[]` passed the test without triggering the exception, we know the second element of that pair must be irrelevant to the result. In other words, LSC has proved the property holds for inputs `(0, X):[]` for all X!

LSC hence won't try specialising `exception C`; but it will try "larger" Int values, e.g. `(1, exception C):[]`, `(-1, exception C):[]`, etc. up to a pre-defined "depth". I'll ignore these cases and focus on the branch with the longer list:

  myTest ((0, exception C):(exception A, exception D):exception B  --> throws A

Again, LSC will specialise the general Int `exception A` into specific values, starting with the "smallest":

  myTest ((0, exception C):(0, exception D):exception B)  --> FAIL!

This input has duplicate keys, which triggered the bug and caused our test to fail. Again, the fact that the exceptions inside this value didn't get thrown means those parts were irrelevant to the result. LSC indicates irrelevant values with `_`, so it will report this counterexample as `(0, _):(0, _):_`.

That's more informative for understanding this bug than e.g. Smallcheck or QuickCheck, whose "minimal" counterexamples would be fully-specified values, like `(0, Customer { name = "", orders = [], discount = Nothing, joined = 1970-01-01 }):(0, Customer { name = "", orders = [], discount = Nothing, joined = 1970-01-01 }):[]`.

Hence LSC's approach has three interesting features:

- It has a coverage-like approach that homes-in on interesting cases, by gradually generating the parts of a value that get forced (which are usually the parts being branched on)

- It can indicate which parts of a counterexample are irrelevant for the failure

- It can prove some properties universally (i.e. when that input is irrelevant to the pass result)

This item has no comments currently.