Cheating With Asserts

Fail fast or fail early is a concept in programming that encourages programmers to detect errors and unexpected conditions as soon as possible and make specific arrangements so that the program immediately stops working. This concept is best described in Wikipedia, in the article by Jim Shore and was also discussed on forums like Stack Overflow.

The idea is to prevent the program from continuing doing bad things like producing incorrect results - the assumption here is that it is better for the program not to run at all than to work in a wrong way. In most cases this is actually true and the other benefit of failing fast is that the program that crashes violently is more likely to get the attention of someone responsible for it (it need not be a human being, it might be some other, supervising program) and who might be in a position to handle the problem.

In short, whenever the programmer has a place in his program that relies on some condition and if that condition is false the program is likely to produce incorrect results further on, it is enough to write:

assert(we_hope_that_something_is_true_here);

(or some equivalent in any given programming language)

Now, if the crucial condition does not hold, the program will crash violently instead of making things even worse further on.

This, in principle, is all true.

The problem

There are two problems, actually.

First, this reasoning implicitly assumes that crashing is a valid program response that might be even part of its design. This can be justified in some interactive programs (including GUI ones) or in batch processors where restarting the operation is not only meaningful, but also harmless for their user.

But not all systems belong to this category. Crashing is certainly not acceptable in real-time systems that operate under strong timing constraints as there might be not enough time to restart them.

Also, crashing is not acceptable in all those systems where the supervising module (the one that presumably has the ability to recognize the crash and to handle the error) and the user/client are distinct, which is, well, in most of the cases. This is when the user connects to his bank in order to do some operations and the server program crashes. That server program might be controlled by some supervising module that will notice the problem and will "handle" it (by, let's guess, writing a line in a log file?), but the user has no visibility to what has happenend and his error handling is therefore ad-hoc. Restarting the server program can be very easy, but repeating all banking operations that were made during that long evening session might be a very bad idea. So - what does it mean "to handle the error" then?

Such examples show that failure is in the eye of beholder - and therefore "failing fast" will have different value (or price) for each party that is involved.

But the real problem that this article is all about is not technical - it is cultural.

Failure as acceptable outcome

The whole idea of failing fast has a very interesting side-effect: it encourages programmers to consider failure as a valid outcome. Failure is not considered to be something that programmers should be afraid of and something that they should avoid. On the contrary - failing is encouraged, even to the point where it becomes a sign of programmer's staying in control. That's right, there is no mistake here.

This problem can be described with a very simple example: memory management. Consider:

char * buffer = (char *)malloc(buffer_size);

The code above allocates a buffer of some size and expects that the allocation was successful. It might not make any sense continuing the work otherwise. Some programmers can say that the allocation failure is unlikely and ignore it altogether, but others will feel more responsibile for it and decide that the possibility of allocation failure should be handled - somehow.

Depending on the actual program there can be many different strategies to react to memory allocation failures. Some of the possibilities include releasing less critical buffers like data caches and retrying the allocation or perhaps attempting some other way to achieve the goal like a different computing algorithm that is less resource intensive, etc.
But implementing such alternative strategies is challenging and... not very rewarding - after all, this is unlikely event and in addition to the lack of business incentive, it's not always clear how this unlikely program path could be ever tested.
All this sounds like an unnecessary burden, so let's go the easy way:

char * buffer = (char *)malloc(buffer_size);

assert(buffer != NULL);

// HURRAY!

// ...

That's it. Problem solved, let's move on to some more exciting tasks.

Other examples that follow the same pattern include creating system-level resource like a thread, mutex, socket, etc. - these things usually work and even though the system functions that have to be used here all anticipate the possibility of failure by defining appropriate error codes, programmers are so used to successful outcomes that they stop taking the failure under consideration. And this is when they come up with easy solutions like assert that seems to "handle" the error in a single line of code.

Wait!

There is a very important phrase in the paragraph above that is worth restating:

Programmers turn to easy solutions like assert when they don't really take the failure under consideration.

Or maybe even:

Programmers turn to easy solutions like assert in order to hide problems that they do not want to or are just unable to handle in a proper way.

And of course, the whole discussion is not limited to asserts - replacing them with something like this (in a language that supports exceptions):

if (some_condition == false) {

    throw new ForgetAboutItException();

}

has exactly the same underlying problem - it actually avoids doing proper error handling while allowing the programmer to believe that he has done his job correctly.

Is this what fail fast was about? Not likely - this is only what we, the programmers, have turned it into. And it is certainly not how it should be.

I have seen asserts in places that really deserve more mature handling and I openly admit to commit this kind of sin myself. But I think that the light-hearted approach like this is one of the factors that contribute to the steady degradation of software engineering culture as a whole. Let's stop this collective cheating and pretending that we are in control when the only thing that we do is sweeping problems under the carpet.

How to do better

Having said that, it makes sense to point examples where the possibility of failure is really taken into account as it should be.

Good sources of such examples are those programming languages that literally force the programmer to deal with every possible outcome of any single operation and check that statically - the SPARK programming language, which is associated with Ada, is one such example, where pretending that some condition is so unlikely that it "never happens" and can be therefore ignored is a compiler-time error and unless the programmer makes an effort to deal with every possible outcome, the program will simply not compile at all.
Consider these two engineering approaches:

It is better to crash at run-time than to continue with incorrect results.
It is better not to compile at all than to be incorrect.

Hopefully the difference between these two is well understood - even though the first statement is accepted as true (but remember that failure is in the eye of beholder, as explained earlier), the second is much stronger and reflects much deeper engineering insight. Still, the programming community seems to be satisfied with the cheap and easy path.

Switching to another programming language might be unacceptable for many teams, but even with the mainstream technologies it is possible to do a lot better with careful selection of language subsets and a good dose of coding discipline guided by proven coding standards - MISRA-C is an example of such a standard for C programmers, although such standards do not seem to have enough visibility and are not widely adopted within the open-source community.

How this state of affairs could be improved? It is not clear what will work in the long run, but at least we can do our own homework by offering products that raise the bar in this area. Check the YAMI4 MISRA-C package, which is a messaging library written in C for those distributed systems where cheating with asserts is not acceptable. Our claim is that this code properly deals with all conditions, without any cheating - feel free to inspect the code and share your feedback.