Re: Optimisations and undefined behaviour

Richard Earnshaw <Richard.Earnshaw@xxxxxxxxxxxx> · Tue, 10 Nov 2015 11:06:39 +0000

On 10/11/15 10:44, David Brown wrote:
> On 10/11/15 11:16, Richard Earnshaw wrote:
>> On 10/11/15 09:20, David Brown wrote:
>>> On 09/11/15 17:31, Andrew Haley wrote:
>>>> On 11/09/2015 03:56 PM, David Brown wrote:
>>>>
>>>>> We typically cannot use "sanatize" options, nor can we accept that a
>>>>> bug in one part of the program causes undue and unnecessarily
>>>>> damaging side-effects in other parts.
>>>>
>>>> Well, you have to get used to that.  It is reality: computers work
>>>> that way.  I'm sure you know that if you hit the wrong I/O port
>>>> with a wild write odd things will happen.  Whether that's "undue" or
>>>> "unnecessary" I couldn't say: it just is.
>>>
>>> There is no doubt that with C, it is easy to make mistakes that can lead
>>> to disaster.  One little typo, and you access an array outside of its
>>> bounds - with the result of corrupting a completely independent piece of
>>> data, showing up as a problem in a different part of the program that
>>> was already tested and working fine.  These things are part of the risk
>>> of C programming - it's the cost you pay for maximum speed and efficiency.
>>>
>>> But let's look back at Annex L in the C11 standards, and ask what it is
>>> /for/.  It is labelled "Analyzability" - the aim is to help developers
>>> (and code reviewers, and static analysers) figure out if code is correct
>>> or not.  Part of that is to place bounds on the effects of some types of
>>> undefined behaviour.  Now, you can certainly argue that Annex L doesn't
>>> go far enough to be useful, or it is too vague, or that it is
>>> impractical or limiting to implement, or that knock-on effects can
>>> quickly turn bounded undefined behaviour into unbounded undefined
>>> behaviour.  But the very existence of this Annex in the C standards
>>> shows how important these things are.
>>>
>>> To my mind, if execution hits UB, then it's a bug in the program.  It
>>> doesn't really matter if the bug is UB, or an error in the algorithm, or
>>> a coding mistake, or even a mistake in the customer's requirement
>>> specifications.  It's a point where the computer does not do what the
>>> user expects it to do.  Is it so unreasonable to want code that is
>>> correct up to that bug, to run correctly until the bug is hit?  Is it
>>> unreasonable to want the compiler to warn if it sees such a bug, and
>>> uses it to change the past action of the code?
>>>
>>> I know that once code hits the bug, all sorts of things can go wrong.
>>> All I ask is that the compiler should not make things worse - at least
>>> not without informing the developer.
>>>
>>>
>>>>
>>>> C definitely works that way.  Maybe there should be a nice small
>>>> language which is useful for embedded developers and doesn't have
>>>> all the interesting UB properties that C has.  (Ada, maybe?  Probably
>>>> not.)  Maybe you could define a language compatible with C with the UB
>>>> removed.  But defining the semantics of such a language would not be
>>>> easy.  And I don't think it makes much sense to change GCC without
>>>> such a rigorous language definition.
>>>>
>>>
>>> I don't believe it is possible to make a general and efficient
>>> programming language without UB.  And even if it could be eliminated at
>>> the language level, it would still exist at a higher level - a square
>>> root function could well have undefined behaviour for negative inputs.
>>> I also am happy that gcc can make many small optimisations based on its
>>> knowledge of UB - strength reduction and constant folding in integer
>>> expressions is a key example.
>>>
>>> But I can also see the potential for optimisations based on UB to make
>>> developers' lives much more difficult by working logically backwards in
>>> time.
>>>
>>
>> You have to remember that a lot (but by no means all) of UB in C comes
>> from the desire to have an efficient language that is portable to many
>> architectures.  C has to support 'int' of unspecified maximum size,
>> machines that have two's complement, one's complement and sign-magnitude
>> arithmetic, multiple different floating point formats.  The way the
>> language committee has enabled such support without requiring the tools
>> to produce grossly inefficient code is to say that if you step outside
>> the limits of the machine you are running on, then the behaviour is
>> undefined.  However, it didn't specifically qualify what had to happen
>> in those cases and didn't distinguish them from other cases where UB can
>> occur; it just said, if this happens your program is broken.
>> Furthermore, it said quite clearly that compilers do not HAVE to detect
>> such brokenness - because in the extreme cases that would be very
>> expensive or even impossible to prove.
> 
> I understand that - and I agree with the principle behind the use of UB
> in the C standards.  (I think perhaps some of the outdated hardware,
> such as support for anything other than two's complement signed
> integers, could be depreciated in newer C standards - but that's another
> issue.)
> 
> However, just because the C standards make certain behaviour undefined,
> does not mean that C /compilers/ must leave it that way.  In particular,
> some undefined behaviour situations could usefully be turned into
> unspecified behaviour while retaining much of the optimisation
> flexibility, but with less of the danger.  I think Annex L could be
> described in that way.  Some types of undefined behaviour, such as
> accessing arrays out of bounds, cannot reasonably be bounded or limited
> in any way - you may be writing over some other data, and the results
> are completely unpredictable.
> 
> But consider if integer overflow were made "unspecified" rather than
> "undefined" - i.e., it were guaranteed to return a valid integer, but
> you know nothing about /which/ integer it is.  For efficiency, that
> would require that the hardware does not have enforced traps on
> overflow.  But the compiler is still able to optimise "x * 3 / 6" to "x
> / 2", and still able to assume that "x + 1 > x".  But given "x = 1290;
> foo(); y = x * x * x;" it is no longer able to eliminate the call to "foo".
> 

I /think/ you've just said that x86 can no-longer use its divide
instruction.  That instruction faults if you have n/0 or INT_MIN/-1.

R.

> (It is quite possible that my analysis above only applies in some
> circumstances, not others - I haven't attempted to think through a range
> of cases, but merely to give an example of how it is possible to
> generate efficient code without surprising the programmer /too/ much.)
> 
>>
>> (I would add at this point, that it is also very hard at times to prove
>> something would invoke UB, so the ability to emit a diagnostic could
>> depend on previous optimizations done on the code before the check is
>> performed -- you only have to look at the history of
>> -Wmaybe-uninitialized to realize how hard it is to get things like this
>> right; and false positives can be as painful for developers as missing
>> warnings or warnings that seem to appear and disappear at random).
> 
> I appreciate that this is very hard, and that giving a useful balance
> between warnings with few false positives is often a difficult process.
> 
>>
>>> (Note that in all my testing, I have not found gcc to perform any of
>>> these unwanted "optimisations" that I fear - but I haven't found good
>>> reason to be sure that they won't be performed in the future.)
>>>
>>
>> And I doubt we would ever make such a guarantee.  It might severely
>> hamper GCC's stated goal of being a production-quality optimizing compiler.
> 
> gcc is already a highly optimising compiler - if it is not "production
> quality", then I don't know any compiler that /is/ "production quality".
>  And I support the goal of making gcc ever better.
> 
> But the goal of efficient code generation must not be the only one, and
> gcc must not lose sight of other important goals.  For the programmer,
> writing /correct/ code must always outweigh /fast/ code - and the
> compiler should be a tool to aid the programmer.
> 
>>
>>>
>>> And yes, I realise that taken to extremes, this is ultimately an
>>> impossible task.  The point is merely to aim in that direction.  I don't
>>> expect gcc to warn on every possible error - but I am happy to see it
>>> warning on more and more possible errors.  What I don't want to see is
>>> is that since it is impossible to know /exactly/ what the programmer
>>> wanted when there is UB in the code, that means the compiler can refuse
>>> all responsibility for trying.  That would IMHO be a disservice to the
>>> user, making it harder to find and fix bugs.
>>
> 
>