Re: [next] arm: Internal error: Oops: 5 PC is at __read_once_word_nocheck

"Russell King (Oracle)" <linux@xxxxxxxxxxxxxxx> · Thu, 10 Mar 2022 12:35:55 +0000

On Wed, Mar 09, 2022 at 09:42:29PM +0100, Ard Biesheuvel wrote:
> On Wed, 9 Mar 2022 at 20:39, Russell King (Oracle)
> <linux@xxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, Mar 09, 2022 at 08:14:30PM +0100, Ard Biesheuvel wrote:
> > > The backtrace dumped by __die() uses the pt_regs from the exception
> > > context as the starting point, so the exception entry code that deals
> > > with the condition that triggered the oops is omitted, and does not
> > > have to be unwound.
> >
> > That is true, but that's not really the case I was thinking about.
> > I was thinking more about cases such as RCU stalls, soft lockups,
> > etc.
> >
> > For example:
> >
> > https://www.linuxquestions.org/questions/linux-kernel-70/kenel-v4-4-60-panic-in-igmp6_send-and-and-__neigh_create-4175704721/
> >
> > In that stack trace, the interesting bits are not the beginning of
> > the stack trace down to __irq_svc, but everything beyond __irq_svc,
> > since the lockup is probably caused by being stuck in
> > _raw_write_lock_bh().
> >
> > It's these situations that we will totally destroy debuggability for,
> > and the only way around that would be to force frame pointers and
> > ARM builds (not Thumb-2 as that requires the unwinder... which means
> > a Thumb-2 kernel soft lockup would be undebuggable.
> >
> 
> Indeed.
> 
> But that means that the only other choice we have is to retain the
> imprecise nature of the current solution (which usually works fine
> btw), and simply deal with the faulting double dereference of vsp in
> the unwinder code. We simply don't know whether the exception was
> taken at a point where the stack frame is consistent with the unwind
> data.

Okay, further analysis (for the record, since I've said much of this on
IRC):

What we have currently is a robust unwinder that will cope when things
go wrong, such as an interrupt taken during the prologue of a function.
The way it copes is by two mechanisms:

        /* store the highest address on the stack to avoid crossing it*/
        low = frame->sp;
        ctrl.sp_high = ALIGN(low, THREAD_SIZE);

These two represent the allowable bounds of the kernel stack. When we
run the unwinder, before each unwind instruction we check whether the
current SP value is getting close to the top of the kernel stack, and
if so, turn on additional checking:

                if ((ctrl.sp_high - ctrl.vrs[SP]) < sizeof(ctrl.vrs))
                        ctrl.check_each_pop = 1;

that will ensure if we go off the top of the kernel stack, the
unwinder will report failure, and not access those addresses.

After each instruction, we check whether the SP value is within the
above bounds:

                if (ctrl.vrs[SP] < low || ctrl.vrs[SP] >= ctrl.sp_high)
                        return -URC_FAILURE;

This means that the unwinder can never modify SP to point outside of
the kernel stack region identified by low..ctrl.sp_high, thereby
protecting the load inside unwind_pop_register() from ever
dereferencing something outside of the kernel stack. Moreover, it also
prevents the unwinder modifying SP to point below the current stack
frame.

The problem has been introduced by trying to make the unwinder cope
with IRQ stacks in b6506981f880 ("ARM: unwind: support unwinding across
multiple stacks"):

-       if (!load_sp)
+       if (!load_sp) {
                ctrl->vrs[SP] = (unsigned long)vsp;
+       } else {
+               ctrl->sp_low = ctrl->vrs[SP];
+               ctrl->sp_high = ALIGN(ctrl->sp_low, THREAD_SIZE);
+       }

Now, whenever SP is loaded, we reset the allowable range for the SP
value, and this completely defeats the protections we previously had
which were ensuring that:

1) the SP value doesn't jump back _down_ the kernel stack resulting
   in an infinite unwind loop.
2) the SP value doesn't end up outside the kernel stack.

We need those protections to prevent these problems that are being
reported - and the most efficient way I can think of doing that is to
somehow valudate the new SP value _before_ we modify sp_low and
sp_high, so these two limits are always valid.

Merely changing the READ_ONCE_NOCHECK() to be get_kernel_nocheck()
will only partly fix this problem - it will stop the unwinder oopsing
the kernel, but definitely doesn't protect against (1) and doesn't
protect against SP pointing at some thing that is accessible (e.g.
a device or other kernel memory.)

We're yet again at Thursday, with the last linux-next prior to the
merge window being created this evening, which really doesn't leave
much time to get this sorted... and we can't say "this code should
have been in earlier in the cycle" this time around, because these
changes to the unwinder have been present in linux-next prior to
5.17-rc2. Annoyingly, it seems merging stuff earlier in the cycle
doesn't actually solve the problem of these last minute debugging
panics.

Any suggestions for what we should do? Can we come up with some way
to validate the new SP value before 6pm UTC this evening?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!