Re: [next] arm: Internal error: Oops: 5 PC is at __read_once_word_nocheck

"Russell King (Oracle)" <linux@xxxxxxxxxxxxxxx> · Thu, 10 Mar 2022 13:01:01 +0000

On Thu, Mar 10, 2022 at 12:35:55PM +0000, Russell King (Oracle) wrote:
> On Wed, Mar 09, 2022 at 09:42:29PM +0100, Ard Biesheuvel wrote:
> > On Wed, 9 Mar 2022 at 20:39, Russell King (Oracle)
> > <linux@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Mar 09, 2022 at 08:14:30PM +0100, Ard Biesheuvel wrote:
> > > > The backtrace dumped by __die() uses the pt_regs from the exception
> > > > context as the starting point, so the exception entry code that deals
> > > > with the condition that triggered the oops is omitted, and does not
> > > > have to be unwound.
> > >
> > > That is true, but that's not really the case I was thinking about.
> > > I was thinking more about cases such as RCU stalls, soft lockups,
> > > etc.
> > >
> > > For example:
> > >
> > > https://www.linuxquestions.org/questions/linux-kernel-70/kenel-v4-4-60-panic-in-igmp6_send-and-and-__neigh_create-4175704721/
> > >
> > > In that stack trace, the interesting bits are not the beginning of
> > > the stack trace down to __irq_svc, but everything beyond __irq_svc,
> > > since the lockup is probably caused by being stuck in
> > > _raw_write_lock_bh().
> > >
> > > It's these situations that we will totally destroy debuggability for,
> > > and the only way around that would be to force frame pointers and
> > > ARM builds (not Thumb-2 as that requires the unwinder... which means
> > > a Thumb-2 kernel soft lockup would be undebuggable.
> > >
> > 
> > Indeed.
> > 
> > But that means that the only other choice we have is to retain the
> > imprecise nature of the current solution (which usually works fine
> > btw), and simply deal with the faulting double dereference of vsp in
> > the unwinder code. We simply don't know whether the exception was
> > taken at a point where the stack frame is consistent with the unwind
> > data.
> 
> Okay, further analysis (for the record, since I've said much of this on
> IRC):
> 
> What we have currently is a robust unwinder that will cope when things
> go wrong, such as an interrupt taken during the prologue of a function.
> The way it copes is by two mechanisms:
> 
>         /* store the highest address on the stack to avoid crossing it*/
>         low = frame->sp;
>         ctrl.sp_high = ALIGN(low, THREAD_SIZE);
> 
> These two represent the allowable bounds of the kernel stack. When we
> run the unwinder, before each unwind instruction we check whether the
> current SP value is getting close to the top of the kernel stack, and
> if so, turn on additional checking:
> 
>                 if ((ctrl.sp_high - ctrl.vrs[SP]) < sizeof(ctrl.vrs))
>                         ctrl.check_each_pop = 1;
> 
> that will ensure if we go off the top of the kernel stack, the
> unwinder will report failure, and not access those addresses.
> 
> After each instruction, we check whether the SP value is within the
> above bounds:
> 
>                 if (ctrl.vrs[SP] < low || ctrl.vrs[SP] >= ctrl.sp_high)
>                         return -URC_FAILURE;
> 
> This means that the unwinder can never modify SP to point outside of
> the kernel stack region identified by low..ctrl.sp_high, thereby
> protecting the load inside unwind_pop_register() from ever
> dereferencing something outside of the kernel stack. Moreover, it also
> prevents the unwinder modifying SP to point below the current stack
> frame.
> 
> The problem has been introduced by trying to make the unwinder cope
> with IRQ stacks in b6506981f880 ("ARM: unwind: support unwinding across
> multiple stacks"):
> 
> -       if (!load_sp)
> +       if (!load_sp) {
>                 ctrl->vrs[SP] = (unsigned long)vsp;
> +       } else {
> +               ctrl->sp_low = ctrl->vrs[SP];
> +               ctrl->sp_high = ALIGN(ctrl->sp_low, THREAD_SIZE);
> +       }
> 
> Now, whenever SP is loaded, we reset the allowable range for the SP
> value, and this completely defeats the protections we previously had
> which were ensuring that:
> 
> 1) the SP value doesn't jump back _down_ the kernel stack resulting
>    in an infinite unwind loop.
> 2) the SP value doesn't end up outside the kernel stack.
> 
> We need those protections to prevent these problems that are being
> reported - and the most efficient way I can think of doing that is to
> somehow valudate the new SP value _before_ we modify sp_low and
> sp_high, so these two limits are always valid.
> 
> Merely changing the READ_ONCE_NOCHECK() to be get_kernel_nocheck()
> will only partly fix this problem - it will stop the unwinder oopsing
> the kernel, but definitely doesn't protect against (1) and doesn't
> protect against SP pointing at some thing that is accessible (e.g.
> a device or other kernel memory.)
> 
> We're yet again at Thursday, with the last linux-next prior to the
> merge window being created this evening, which really doesn't leave
> much time to get this sorted... and we can't say "this code should
> have been in earlier in the cycle" this time around, because these
> changes to the unwinder have been present in linux-next prior to
> 5.17-rc2. Annoyingly, it seems merging stuff earlier in the cycle
> doesn't actually solve the problem of these last minute debugging
> panics.
> 
> Any suggestions for what we should do? Can we come up with some way
> to validate the new SP value before 6pm UTC this evening?

Also, looking deeper at the last linaro job report:

https://lkft.validation.linaro.org/scheduler/job/4688599#L684

the dumped memory doesn't look like an exception stack. If it was,
e82aab40 would be the saved CPSR value and c388eb80 would be the PC
value, both of which are nonsense.

The stack that we were in (and we dumped out in full) was:

Stack: (0xc381bb30 to 0xc381c000)

and the exception stack (the saved pt_regs) is:

                                    r0       r1       r2       r3       r4
bfa0: c2ba47c0 0000000a c2ba1358 ffff9537 c2c05d00 00400140 c62d5624 c1948b20
         r5       r6       r7       r8       r9       r10      fp       ip
bfc0: e82ab498 c62d5400 c35377a0 c62d5404 25706000 c381bfe8 c62d5400 00000001
         sp       lr       pc      cpsr    orig_r0
bfe0: c5fcfc48 c036251c c0995a14 20040013 ffffffff c5fcfc7c c62d5400 c0300bf0

but, we end up dumping out:

Exception stack(0xc5fcfc98 to 0xc5fcfce0)
fc80:                                                       b6a3ab70 00000004
fca0: 00000000 00000004 b6a3ab70 c055f928 c388eb80 c5fcfd40 00000000 c5fcfd50
fcc0: 00000005 00000051 e82aad1c c03ae570 00000000 c388eb80 c3512a20 e82aab40

Firstly, that's in the wrong stack to be dumping for the exception
stack, and secondly, why is it 0x50 bytes above the saved SP value from
the real exception stack - that makes no sense in itself.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!