Re: [next] arm: Internal error: Oops: 5 PC is at __read_once_word_nocheck

Ard Biesheuvel <ardb@xxxxxxxxxx> · Thu, 10 Mar 2022 14:14:56 +0100

On Thu, 10 Mar 2022 at 14:01, Russell King (Oracle)
<linux@xxxxxxxxxxxxxxx> wrote:
>
> On Thu, Mar 10, 2022 at 12:35:55PM +0000, Russell King (Oracle) wrote:
> > On Wed, Mar 09, 2022 at 09:42:29PM +0100, Ard Biesheuvel wrote:
> > > On Wed, 9 Mar 2022 at 20:39, Russell King (Oracle)
> > > <linux@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Mar 09, 2022 at 08:14:30PM +0100, Ard Biesheuvel wrote:
> > > > > The backtrace dumped by __die() uses the pt_regs from the exception
> > > > > context as the starting point, so the exception entry code that deals
> > > > > with the condition that triggered the oops is omitted, and does not
> > > > > have to be unwound.
> > > >
> > > > That is true, but that's not really the case I was thinking about.
> > > > I was thinking more about cases such as RCU stalls, soft lockups,
> > > > etc.
> > > >
> > > > For example:
> > > >
> > > > https://www.linuxquestions.org/questions/linux-kernel-70/kenel-v4-4-60-panic-in-igmp6_send-and-and-__neigh_create-4175704721/
> > > >
> > > > In that stack trace, the interesting bits are not the beginning of
> > > > the stack trace down to __irq_svc, but everything beyond __irq_svc,
> > > > since the lockup is probably caused by being stuck in
> > > > _raw_write_lock_bh().
> > > >
> > > > It's these situations that we will totally destroy debuggability for,
> > > > and the only way around that would be to force frame pointers and
> > > > ARM builds (not Thumb-2 as that requires the unwinder... which means
> > > > a Thumb-2 kernel soft lockup would be undebuggable.
> > > >
> > >
> > > Indeed.
> > >
> > > But that means that the only other choice we have is to retain the
> > > imprecise nature of the current solution (which usually works fine
> > > btw), and simply deal with the faulting double dereference of vsp in
> > > the unwinder code. We simply don't know whether the exception was
> > > taken at a point where the stack frame is consistent with the unwind
> > > data.
> >
> > Okay, further analysis (for the record, since I've said much of this on
> > IRC):
> >
> > What we have currently is a robust unwinder that will cope when things
> > go wrong, such as an interrupt taken during the prologue of a function.
> > The way it copes is by two mechanisms:
> >
> >         /* store the highest address on the stack to avoid crossing it*/
> >         low = frame->sp;
> >         ctrl.sp_high = ALIGN(low, THREAD_SIZE);
> >
> > These two represent the allowable bounds of the kernel stack. When we
> > run the unwinder, before each unwind instruction we check whether the
> > current SP value is getting close to the top of the kernel stack, and
> > if so, turn on additional checking:
> >
> >                 if ((ctrl.sp_high - ctrl.vrs[SP]) < sizeof(ctrl.vrs))
> >                         ctrl.check_each_pop = 1;
> >
> > that will ensure if we go off the top of the kernel stack, the
> > unwinder will report failure, and not access those addresses.
> >
> > After each instruction, we check whether the SP value is within the
> > above bounds:
> >
> >                 if (ctrl.vrs[SP] < low || ctrl.vrs[SP] >= ctrl.sp_high)
> >                         return -URC_FAILURE;
> >
> > This means that the unwinder can never modify SP to point outside of
> > the kernel stack region identified by low..ctrl.sp_high, thereby
> > protecting the load inside unwind_pop_register() from ever
> > dereferencing something outside of the kernel stack. Moreover, it also
> > prevents the unwinder modifying SP to point below the current stack
> > frame.
> >
> > The problem has been introduced by trying to make the unwinder cope
> > with IRQ stacks in b6506981f880 ("ARM: unwind: support unwinding across
> > multiple stacks"):
> >
> > -       if (!load_sp)
> > +       if (!load_sp) {
> >                 ctrl->vrs[SP] = (unsigned long)vsp;
> > +       } else {
> > +               ctrl->sp_low = ctrl->vrs[SP];
> > +               ctrl->sp_high = ALIGN(ctrl->sp_low, THREAD_SIZE);
> > +       }
> >
> > Now, whenever SP is loaded, we reset the allowable range for the SP
> > value, and this completely defeats the protections we previously had
> > which were ensuring that:
> >
> > 1) the SP value doesn't jump back _down_ the kernel stack resulting
> >    in an infinite unwind loop.
> > 2) the SP value doesn't end up outside the kernel stack.
> >
> > We need those protections to prevent these problems that are being
> > reported - and the most efficient way I can think of doing that is to
> > somehow valudate the new SP value _before_ we modify sp_low and
> > sp_high, so these two limits are always valid.
> >
> > Merely changing the READ_ONCE_NOCHECK() to be get_kernel_nocheck()
> > will only partly fix this problem - it will stop the unwinder oopsing
> > the kernel, but definitely doesn't protect against (1) and doesn't
> > protect against SP pointing at some thing that is accessible (e.g.
> > a device or other kernel memory.)
> >
> > We're yet again at Thursday, with the last linux-next prior to the
> > merge window being created this evening, which really doesn't leave
> > much time to get this sorted... and we can't say "this code should
> > have been in earlier in the cycle" this time around, because these
> > changes to the unwinder have been present in linux-next prior to
> > 5.17-rc2. Annoyingly, it seems merging stuff earlier in the cycle
> > doesn't actually solve the problem of these last minute debugging
> > panics.
> >
> > Any suggestions for what we should do? Can we come up with some way
> > to validate the new SP value before 6pm UTC this evening?
>
> Also, looking deeper at the last linaro job report:
>
> https://lkft.validation.linaro.org/scheduler/job/4688599#L684
>
> the dumped memory doesn't look like an exception stack. If it was,
> e82aab40 would be the saved CPSR value and c388eb80 would be the PC
> value, both of which are nonsense.
>
> The stack that we were in (and we dumped out in full) was:
>
> Stack: (0xc381bb30 to 0xc381c000)
>

This is the IRQ stack that we switch to in

[ 29.198048] irq_exit from __irq_svc+0x70/0x8c
[ 29.200896] Exception stack(0xc5fcfc98 to 0xc5fcfce0)

> and the exception stack (the saved pt_regs) is:
>
>                                     r0       r1       r2       r3       r4
> bfa0: c2ba47c0 0000000a c2ba1358 ffff9537 c2c05d00 00400140 c62d5624 c1948b20
>          r5       r6       r7       r8       r9       r10      fp       ip
> bfc0: e82ab498 c62d5400 c35377a0 c62d5404 25706000 c381bfe8 c62d5400 00000001
>          sp       lr       pc      cpsr    orig_r0
> bfe0: c5fcfc48 c036251c c0995a14 20040013 ffffffff c5fcfc7c c62d5400 c0300bf0
>

I don't follow this - which pt_regs is this?

> but, we end up dumping out:
>
> Exception stack(0xc5fcfc98 to 0xc5fcfce0)
> fc80:                                                       b6a3ab70 00000004
> fca0: 00000000 00000004 b6a3ab70 c055f928 c388eb80 c5fcfd40 00000000 c5fcfd50
> fcc0: 00000005 00000051 e82aad1c c03ae570 00000000 c388eb80 c3512a20 e82aab40
>
> Firstly, that's in the wrong stack to be dumping for the exception
> stack, and secondly, why is it 0x50 bytes above the saved SP value from
> the real exception stack - that makes no sense in itself.
>

This is the exception stack that we were on when taking an IRQ and
subsequently switching to the IRQ stack.