On Wed, Mar 09, 2022 at 09:42:29PM +0100, Ard Biesheuvel wrote: > On Wed, 9 Mar 2022 at 20:39, Russell King (Oracle) > <linux@xxxxxxxxxxxxxxx> wrote: > > > > On Wed, Mar 09, 2022 at 08:14:30PM +0100, Ard Biesheuvel wrote: > > > The backtrace dumped by __die() uses the pt_regs from the exception > > > context as the starting point, so the exception entry code that deals > > > with the condition that triggered the oops is omitted, and does not > > > have to be unwound. > > > > That is true, but that's not really the case I was thinking about. > > I was thinking more about cases such as RCU stalls, soft lockups, > > etc. > > > > For example: > > > > https://www.linuxquestions.org/questions/linux-kernel-70/kenel-v4-4-60-panic-in-igmp6_send-and-and-__neigh_create-4175704721/ > > > > In that stack trace, the interesting bits are not the beginning of > > the stack trace down to __irq_svc, but everything beyond __irq_svc, > > since the lockup is probably caused by being stuck in > > _raw_write_lock_bh(). > > > > It's these situations that we will totally destroy debuggability for, > > and the only way around that would be to force frame pointers and > > ARM builds (not Thumb-2 as that requires the unwinder... which means > > a Thumb-2 kernel soft lockup would be undebuggable. > > > > Indeed. > > But that means that the only other choice we have is to retain the > imprecise nature of the current solution (which usually works fine > btw), and simply deal with the faulting double dereference of vsp in > the unwinder code. We simply don't know whether the exception was > taken at a point where the stack frame is consistent with the unwind > data. Okay, further analysis (for the record, since I've said much of this on IRC): What we have currently is a robust unwinder that will cope when things go wrong, such as an interrupt taken during the prologue of a function. The way it copes is by two mechanisms: /* store the highest address on the stack to avoid crossing it*/ low = frame->sp; ctrl.sp_high = ALIGN(low, THREAD_SIZE); These two represent the allowable bounds of the kernel stack. When we run the unwinder, before each unwind instruction we check whether the current SP value is getting close to the top of the kernel stack, and if so, turn on additional checking: if ((ctrl.sp_high - ctrl.vrs[SP]) < sizeof(ctrl.vrs)) ctrl.check_each_pop = 1; that will ensure if we go off the top of the kernel stack, the unwinder will report failure, and not access those addresses. After each instruction, we check whether the SP value is within the above bounds: if (ctrl.vrs[SP] < low || ctrl.vrs[SP] >= ctrl.sp_high) return -URC_FAILURE; This means that the unwinder can never modify SP to point outside of the kernel stack region identified by low..ctrl.sp_high, thereby protecting the load inside unwind_pop_register() from ever dereferencing something outside of the kernel stack. Moreover, it also prevents the unwinder modifying SP to point below the current stack frame. The problem has been introduced by trying to make the unwinder cope with IRQ stacks in b6506981f880 ("ARM: unwind: support unwinding across multiple stacks"): - if (!load_sp) + if (!load_sp) { ctrl->vrs[SP] = (unsigned long)vsp; + } else { + ctrl->sp_low = ctrl->vrs[SP]; + ctrl->sp_high = ALIGN(ctrl->sp_low, THREAD_SIZE); + } Now, whenever SP is loaded, we reset the allowable range for the SP value, and this completely defeats the protections we previously had which were ensuring that: 1) the SP value doesn't jump back _down_ the kernel stack resulting in an infinite unwind loop. 2) the SP value doesn't end up outside the kernel stack. We need those protections to prevent these problems that are being reported - and the most efficient way I can think of doing that is to somehow valudate the new SP value _before_ we modify sp_low and sp_high, so these two limits are always valid. Merely changing the READ_ONCE_NOCHECK() to be get_kernel_nocheck() will only partly fix this problem - it will stop the unwinder oopsing the kernel, but definitely doesn't protect against (1) and doesn't protect against SP pointing at some thing that is accessible (e.g. a device or other kernel memory.) We're yet again at Thursday, with the last linux-next prior to the merge window being created this evening, which really doesn't leave much time to get this sorted... and we can't say "this code should have been in earlier in the cycle" this time around, because these changes to the unwinder have been present in linux-next prior to 5.17-rc2. Annoyingly, it seems merging stuff earlier in the cycle doesn't actually solve the problem of these last minute debugging panics. Any suggestions for what we should do? Can we come up with some way to validate the new SP value before 6pm UTC this evening? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!