Re: Kernel oops caused by signed divide

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 10 Sep 2024 11:25:09 -0700

On Tue, Sep 10, 2024 at 11:02 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
> On 9/10/24 8:21 AM, Alexei Starovoitov wrote:
> > On Tue, Sep 10, 2024 at 7:21 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>
> >> On 9/9/24 10:29 AM, Alexei Starovoitov wrote:
> >>> On Mon, Sep 9, 2024 at 10:21 AM Zac Ecob <zacecob@xxxxxxxxxxxxxx> wrote:
> >>>> Hello,
> >>>>
> >>>> I recently received a kernel 'oops' about a divide error.
> >>>> After some research, it seems that the 'div64_s64' function used for the 'MOD'/'REM' instructions boils down to an 'idiv'.
> >>>>
> >>>> The 'dividend' is set to INT64_MIN, and the 'divisor' to -1, then because of two's complement, there is no corresponding positive value, causing the error (at least to my understanding).
> >>>>
> >>>>
> >>>> Apologies if this is already known / not a relevant concern.
> >>> Thanks for the report. This is a new issue.
> >>>
> >>> Yonghong,
> >>>
> >>> it's related to the new signed div insn.
> >>> It sounds like we need to update chk_and_div[] part of
> >>> the verifier to account for signed div differently.
> >> In verifier, we have
> >>     /* [R,W]x div 0 -> 0 */
> >>     /* [R,W]x mod 0 -> [R,W]x */
> > the verifier is doing what hw does. In this case this is arm64 behavior.
>
> Okay, I see. I tried on a arm64 machine it indeed hehaves like the above.
>
> # uname -a
> Linux ... #1 SMP PREEMPT_DYNAMIC Thu Aug  1 06:58:32 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux
> # cat t2.c
> #include <stdio.h>
> #include <limits.h>
> int main(void) {
>    volatile long long a = 5;
>    volatile long long b = 0;
>    printf("a/b = %lld\n", a/b);
>    return 0;
> }
> # cat t3.c
> #include <stdio.h>
> #include <limits.h>
> int main(void) {
>    volatile long long a = 5;
>    volatile long long b = 0;
>    printf("a%%b = %lld\n", a%b);
>    return 0;
> }
> # gcc -O2 t2.c && ./a.out
> a/b = 0
> # gcc -O2 t3.c && ./a.out
> a%b = 5
>
> on arm64, clang18 compiled binary has the same result
>
> # clang -O2 t2.c && ./a.out
> a/b = 0
> # clang -O2 t3.c && ./a.out
> a%b = 5
>
> The same source code, compiled on x86_64 with -O2 as well,
> it generates:
>    Floating point exception (core dumped)
>
> >
> >> What the value for
> >>     Rx_a sdiv Rx_b -> ?
> >> where Rx_a = INT64_MIN and Rx_b = -1?
> > Why does it matter what Rx_a contains ?
>
> It does matter. See below:
>
> on arm64:
>
> # cat t1.c
> #include <stdio.h>
> #include <limits.h>
> int main(void) {
>    volatile long long a = LLONG_MIN;
>    volatile long long b = -1;
>    printf("a/b = %lld\n", a/b);
>    return 0;
> }
> # clang -O2 t1.c && ./a.out
> a/b = -9223372036854775808
> # gcc -O2 t1.c && ./a.out
> a/b = -9223372036854775808
>
> So the result of a/b is LLONG_MIN
>
> The same code will cause exception on x86_64:
>
> $ uname -a
> Linux ... #1 SMP Wed Jun  5 06:21:21 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
> [yhs@devvm1513.prn0 ~]$ gcc -O2 t1.c && ./a.out
> Floating point exception (core dumped)
> [yhs@devvm1513.prn0 ~]$ clang -O2 t1.c && ./a.out
> Floating point exception (core dumped)
>
> So this is what we care about.
>
> So I guess we can follow arm64 result too.
>
> >
> > What cpus do in this case?
>
> See above. arm64 produces *some* result while x64 cause exception.
> We do need to special handle for LLONG_MIN/(-1) case.

My point about Rx_a that idiv will cause out-of-range exception
for many other values than Rx_a == INT64_MIN.
I'm not sure that divisor -1 is the only such case either.
Probably is, since intuitively -2 and all other divisors should fit fine.
So the check likely needs Rx_b == -1 and a check for high bit in Rx_a ?