Re: Optimisations and undefined behaviour

David Brown <david.brown@xxxxxxxxxxxx> · Tue, 10 Nov 2015 11:44:35 +0100

On 10/11/15 11:16, Richard Earnshaw wrote:
> On 10/11/15 09:20, David Brown wrote:
>> On 09/11/15 17:31, Andrew Haley wrote:
>>> On 11/09/2015 03:56 PM, David Brown wrote:
>>>
>>>> We typically cannot use "sanatize" options, nor can we accept that a
>>>> bug in one part of the program causes undue and unnecessarily
>>>> damaging side-effects in other parts.
>>>
>>> Well, you have to get used to that.  It is reality: computers work
>>> that way.  I'm sure you know that if you hit the wrong I/O port
>>> with a wild write odd things will happen.  Whether that's "undue" or
>>> "unnecessary" I couldn't say: it just is.
>>
>> There is no doubt that with C, it is easy to make mistakes that can lead
>> to disaster.  One little typo, and you access an array outside of its
>> bounds - with the result of corrupting a completely independent piece of
>> data, showing up as a problem in a different part of the program that
>> was already tested and working fine.  These things are part of the risk
>> of C programming - it's the cost you pay for maximum speed and efficiency.
>>
>> But let's look back at Annex L in the C11 standards, and ask what it is
>> /for/.  It is labelled "Analyzability" - the aim is to help developers
>> (and code reviewers, and static analysers) figure out if code is correct
>> or not.  Part of that is to place bounds on the effects of some types of
>> undefined behaviour.  Now, you can certainly argue that Annex L doesn't
>> go far enough to be useful, or it is too vague, or that it is
>> impractical or limiting to implement, or that knock-on effects can
>> quickly turn bounded undefined behaviour into unbounded undefined
>> behaviour.  But the very existence of this Annex in the C standards
>> shows how important these things are.
>>
>> To my mind, if execution hits UB, then it's a bug in the program.  It
>> doesn't really matter if the bug is UB, or an error in the algorithm, or
>> a coding mistake, or even a mistake in the customer's requirement
>> specifications.  It's a point where the computer does not do what the
>> user expects it to do.  Is it so unreasonable to want code that is
>> correct up to that bug, to run correctly until the bug is hit?  Is it
>> unreasonable to want the compiler to warn if it sees such a bug, and
>> uses it to change the past action of the code?
>>
>> I know that once code hits the bug, all sorts of things can go wrong.
>> All I ask is that the compiler should not make things worse - at least
>> not without informing the developer.
>>
>>
>>>
>>> C definitely works that way.  Maybe there should be a nice small
>>> language which is useful for embedded developers and doesn't have
>>> all the interesting UB properties that C has.  (Ada, maybe?  Probably
>>> not.)  Maybe you could define a language compatible with C with the UB
>>> removed.  But defining the semantics of such a language would not be
>>> easy.  And I don't think it makes much sense to change GCC without
>>> such a rigorous language definition.
>>>
>>
>> I don't believe it is possible to make a general and efficient
>> programming language without UB.  And even if it could be eliminated at
>> the language level, it would still exist at a higher level - a square
>> root function could well have undefined behaviour for negative inputs.
>> I also am happy that gcc can make many small optimisations based on its
>> knowledge of UB - strength reduction and constant folding in integer
>> expressions is a key example.
>>
>> But I can also see the potential for optimisations based on UB to make
>> developers' lives much more difficult by working logically backwards in
>> time.
>>
> 
> You have to remember that a lot (but by no means all) of UB in C comes
> from the desire to have an efficient language that is portable to many
> architectures.  C has to support 'int' of unspecified maximum size,
> machines that have two's complement, one's complement and sign-magnitude
> arithmetic, multiple different floating point formats.  The way the
> language committee has enabled such support without requiring the tools
> to produce grossly inefficient code is to say that if you step outside
> the limits of the machine you are running on, then the behaviour is
> undefined.  However, it didn't specifically qualify what had to happen
> in those cases and didn't distinguish them from other cases where UB can
> occur; it just said, if this happens your program is broken.
> Furthermore, it said quite clearly that compilers do not HAVE to detect
> such brokenness - because in the extreme cases that would be very
> expensive or even impossible to prove.

I understand that - and I agree with the principle behind the use of UB
in the C standards.  (I think perhaps some of the outdated hardware,
such as support for anything other than two's complement signed
integers, could be depreciated in newer C standards - but that's another
issue.)

However, just because the C standards make certain behaviour undefined,
does not mean that C /compilers/ must leave it that way.  In particular,
some undefined behaviour situations could usefully be turned into
unspecified behaviour while retaining much of the optimisation
flexibility, but with less of the danger.  I think Annex L could be
described in that way.  Some types of undefined behaviour, such as
accessing arrays out of bounds, cannot reasonably be bounded or limited
in any way - you may be writing over some other data, and the results
are completely unpredictable.

But consider if integer overflow were made "unspecified" rather than
"undefined" - i.e., it were guaranteed to return a valid integer, but
you know nothing about /which/ integer it is.  For efficiency, that
would require that the hardware does not have enforced traps on
overflow.  But the compiler is still able to optimise "x * 3 / 6" to "x
/ 2", and still able to assume that "x + 1 > x".  But given "x = 1290;
foo(); y = x * x * x;" it is no longer able to eliminate the call to "foo".

(It is quite possible that my analysis above only applies in some
circumstances, not others - I haven't attempted to think through a range
of cases, but merely to give an example of how it is possible to
generate efficient code without surprising the programmer /too/ much.)

> 
> (I would add at this point, that it is also very hard at times to prove
> something would invoke UB, so the ability to emit a diagnostic could
> depend on previous optimizations done on the code before the check is
> performed -- you only have to look at the history of
> -Wmaybe-uninitialized to realize how hard it is to get things like this
> right; and false positives can be as painful for developers as missing
> warnings or warnings that seem to appear and disappear at random).

I appreciate that this is very hard, and that giving a useful balance
between warnings with few false positives is often a difficult process.

> 
>> (Note that in all my testing, I have not found gcc to perform any of
>> these unwanted "optimisations" that I fear - but I haven't found good
>> reason to be sure that they won't be performed in the future.)
>>
> 
> And I doubt we would ever make such a guarantee.  It might severely
> hamper GCC's stated goal of being a production-quality optimizing compiler.

gcc is already a highly optimising compiler - if it is not "production
quality", then I don't know any compiler that /is/ "production quality".
 And I support the goal of making gcc ever better.

But the goal of efficient code generation must not be the only one, and
gcc must not lose sight of other important goals.  For the programmer,
writing /correct/ code must always outweigh /fast/ code - and the
compiler should be a tool to aid the programmer.

> 
>>
>> And yes, I realise that taken to extremes, this is ultimately an
>> impossible task.  The point is merely to aim in that direction.  I don't
>> expect gcc to warn on every possible error - but I am happy to see it
>> warning on more and more possible errors.  What I don't want to see is
>> is that since it is impossible to know /exactly/ what the programmer
>> wanted when there is UB in the code, that means the compiler can refuse
>> all responsibility for trying.  That would IMHO be a disservice to the
>> user, making it harder to find and fix bugs.
>