Re: Bug in std::floor?

Mason <slash.tmp@xxxxxxx> · Mon, 13 Nov 2017 11:38:35 +0100

On 09/11/2017 16:46, Vincent Lefevre wrote:

> On 2017-11-09 10:01:45 +0100, Mason wrote:
>
>>> On 08/11/17 23:13 +0200, Nil Geisweiller wrote:
>>>
>>>> The following
>>>>
>>>> ---------------------------------------------------------------------
>>>> #include <iostream>
>>>> #include <cmath>
>>>>
>>>> int main()
>>>> {
>>>>    double v = 4.6;
>>>>    std::cout << "v = " << v << std::endl;
>>>>    std::cout << "v*100 = " << v*100 << std::endl;
>>>>    std::cout << "floor(v*100) = " << std::floor(v*100) << std::endl;
>>>> }
>>>> ---------------------------------------------------------------------
>>>>
>>>> outputs
>>>>
>>>> ------------------
>>>> v = 4.6
>>>> v*100 = 460
>>>> floor(v*100) = 459
>>>> ------------------
>>>>
>>>> It that a bug?
>>
>> There is a boring and excruciatingly long paper floating (ha!) around:
>> "What Every Computer Scientist Should Know About Floating-Point Arithmetic"
>> https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
>>
>> This other site tries to be more accessible:
>> http://floating-point-gui.de/
>>
>> The short version is that it is not possible to represent 0.1
>> (or any multiple thereof) in (fractional) base 2.
> 
> Actually, that's mainly a language design bug, which doesn't show
> the error for 460 (which is representable exactly).

It's not clear to me exactly /what/ you are calling a language design bug?
And in what language? C? IEEE 754?
(AFAIK, C is pretty loose with the floating point spec.)

>> Basically, using "IEEE 754 double-precision binary floating-point format"
>> 4.6 is approximated as D=4.5999999999999996447286321199499070644378662109375
>> ( https://en.wikipedia.org/wiki/Double-precision_floating-point_format )
>>
>> Thus, it is obvious why floor(D*100) equals 459.
> 
> No, this is not obvious. The multiplication by 100 introduces a
> rounding error. Thus this doesn't prove that the rounded value
> D*100 is strictly less than 460.

Let's take 1.7 for example.

In IEEE double precision, it is represented as 3ff'b333333333333,
i.e. 1 + 0.5 + sum(n=1..13, 3/16^n) = D0

1.7 * 2 = 400'b333333333333 = D1
1.7 * 8 = 402'b333333333333 = D3

D0 = 0x1.b333333333333p0;
D1 = 0x1.b333333333333p1;
D3 = 0x1.b333333333333p3;

Scaling D1 to p3
D1 = 0x0.6ccccccccccccp3

0x1.b333333333333p3
0x0.6ccccccccccccp3
-------------------
0x2.1ffffffffffffp3

0x2.1ffffffffffffp3 = 0x1.0fffffffffffffp4
which would be rounded up to 0x1.1p4 = 17 (exact representation)

OK so the error between 17 and D1+D3 is too small to represent
in a double, thus the sum is rounded up. However, if D1+D3 is
stored in a longer intermediate type, such as x87 long double,
then the error can be represented.

Floating point is full of unintuitive surprises.

$ cat fp.bc
scale=60
sum=0
for (n = 1; n <= 13; n = n + 1) sum = sum + 3 / 16^n
res = 1 + 0.5 + sum
res
res * 10
17 - res * 10

$ bc < fp.bc 
 1.699999999999999955591079014993738383054733276367187500000000
16.999999999999999555910790149937383830547332763671875000000000
 0.000000000000000444089209850062616169452667236328125000000000

$ cat fp.c 
#include <stdio.h>
volatile double d = 1.7;
int main(void)
{
	long double ld = (long double)d * 10;
	printf("%.60f %016llx\n", d, *(unsigned long long *)&d);
	d = d * 10;
	printf("%.60f %016llx\n", d, *(unsigned long long *)&d);
	printf("%.60Lf\n", ld);
	return 0;
}

$ gcc-7 -O2 -ffloat-store fp.c && ./a.out 
 1.699999999999999955591079014993738383054733276367187500000000 3ffb333333333333
17.000000000000000000000000000000000000000000000000000000000000 4031000000000000
16.999999999999999555910790149937383830547332763671875000000000

Regards.