Re: Kernel oops caused by signed divide

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/10/24 2:53 PM, Alexei Starovoitov wrote:
On Tue, Sep 10, 2024 at 12:32 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

On 9/10/24 11:25 AM, Alexei Starovoitov wrote:
On Tue, Sep 10, 2024 at 11:02 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
On 9/10/24 8:21 AM, Alexei Starovoitov wrote:
On Tue, Sep 10, 2024 at 7:21 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
On 9/9/24 10:29 AM, Alexei Starovoitov wrote:
On Mon, Sep 9, 2024 at 10:21 AM Zac Ecob <zacecob@xxxxxxxxxxxxxx> wrote:
Hello,

I recently received a kernel 'oops' about a divide error.
After some research, it seems that the 'div64_s64' function used for the 'MOD'/'REM' instructions boils down to an 'idiv'.

The 'dividend' is set to INT64_MIN, and the 'divisor' to -1, then because of two's complement, there is no corresponding positive value, causing the error (at least to my understanding).


Apologies if this is already known / not a relevant concern.
Thanks for the report. This is a new issue.

Yonghong,

it's related to the new signed div insn.
It sounds like we need to update chk_and_div[] part of
the verifier to account for signed div differently.
In verifier, we have
      /* [R,W]x div 0 -> 0 */
      /* [R,W]x mod 0 -> [R,W]x */
the verifier is doing what hw does. In this case this is arm64 behavior.
Okay, I see. I tried on a arm64 machine it indeed hehaves like the above.

# uname -a
Linux ... #1 SMP PREEMPT_DYNAMIC Thu Aug  1 06:58:32 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux
# cat t2.c
#include <stdio.h>
#include <limits.h>
int main(void) {
     volatile long long a = 5;
     volatile long long b = 0;
     printf("a/b = %lld\n", a/b);
     return 0;
}
# cat t3.c
#include <stdio.h>
#include <limits.h>
int main(void) {
     volatile long long a = 5;
     volatile long long b = 0;
     printf("a%%b = %lld\n", a%b);
     return 0;
}
# gcc -O2 t2.c && ./a.out
a/b = 0
# gcc -O2 t3.c && ./a.out
a%b = 5

on arm64, clang18 compiled binary has the same result

# clang -O2 t2.c && ./a.out
a/b = 0
# clang -O2 t3.c && ./a.out
a%b = 5

The same source code, compiled on x86_64 with -O2 as well,
it generates:
     Floating point exception (core dumped)

What the value for
      Rx_a sdiv Rx_b -> ?
where Rx_a = INT64_MIN and Rx_b = -1?
Why does it matter what Rx_a contains ?
It does matter. See below:

on arm64:

# cat t1.c
#include <stdio.h>
#include <limits.h>
int main(void) {
     volatile long long a = LLONG_MIN;
     volatile long long b = -1;
     printf("a/b = %lld\n", a/b);
     return 0;
}
# clang -O2 t1.c && ./a.out
a/b = -9223372036854775808
# gcc -O2 t1.c && ./a.out
a/b = -9223372036854775808

So the result of a/b is LLONG_MIN

The same code will cause exception on x86_64:

$ uname -a
Linux ... #1 SMP Wed Jun  5 06:21:21 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
[yhs@devvm1513.prn0 ~]$ gcc -O2 t1.c && ./a.out
Floating point exception (core dumped)
[yhs@devvm1513.prn0 ~]$ clang -O2 t1.c && ./a.out
Floating point exception (core dumped)

So this is what we care about.

So I guess we can follow arm64 result too.

What cpus do in this case?
See above. arm64 produces *some* result while x64 cause exception.
We do need to special handle for LLONG_MIN/(-1) case.
My point about Rx_a that idiv will cause out-of-range exception
for many other values than Rx_a == INT64_MIN.
I'm not sure that divisor -1 is the only such case either.
Probably is, since intuitively -2 and all other divisors should fit fine.
So the check likely needs Rx_b == -1 and a check for high bit in Rx_a ?
Looks like only Rx_a == INT64_MIN may cause the problem.
All other Rx_a numbers (from INT64_MIN+1 to INT64_MAX)
should be okay. Some selective testing below on x64 host:

$ cat t5.c
#include <stdio.h>
#include <limits.h>

unsigned long long res;
int main(void) {
    volatile long long a;
    long long i;
    for (i = LLONG_MIN + 1; i <= LLONG_MIN + 100; i++) {
      volatile long long b = -1;
      a = i;
      res += (unsigned long long)(a/b);
    }
    for (i = LLONG_MAX - 100; i <= LLONG_MAX - 1; i++) {
Changing this test to i <= LLONG_MAX
and compiling with gcc -O0 or clang -O2 or clang -O0
is causing an exception,
because 'a' becomes LLONG_MIN.

This is my theory.
If change to i <= LLONG_MAX, then after
i = LLONG_MAX, it will do 'i++' and then
it will become tricky since then undefined
behavior will pick in as 'i++' will become
out of range.

Compilers are doing some odd code gen.
I don't understand how 'i' can wrap this way.

      volatile long long b = -1;
      a = i;
      res += (unsigned long long)(a/b);
    }
    printf("res = %llx\n", res);
    return 0;
}
$ gcc -O2 t5.c && ./a.out
res = 64

So I think it should be okay if the range is from LLONG_MIN + 1
to LLONG_MAX - 1.

Now for LLONG_MAX/(-1)

$ cat t6.c
#include <stdio.h>
#include <limits.h>
int main(void) {
    volatile long long a = LLONG_MAX;
    volatile long long b = -1;
    printf("a/b = %lld\n", a/b);
    return 0;
}
$ gcc -O2 t6.c && ./a.out
a/b = -9223372036854775807

It is okay too. So I think LLONG_MIN/(-1) is the only case
we should take care of.
The test shows that that's the case, but I still can wrap
my head around that only LLONG_MIN/(-1) is a problem.

Any math experts can explain this?




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux