On Thu, 10 Mar 2022 at 14:01, Russell King (Oracle) <linux@xxxxxxxxxxxxxxx> wrote: > > On Thu, Mar 10, 2022 at 12:35:55PM +0000, Russell King (Oracle) wrote: > > On Wed, Mar 09, 2022 at 09:42:29PM +0100, Ard Biesheuvel wrote: > > > On Wed, 9 Mar 2022 at 20:39, Russell King (Oracle) > > > <linux@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Wed, Mar 09, 2022 at 08:14:30PM +0100, Ard Biesheuvel wrote: > > > > > The backtrace dumped by __die() uses the pt_regs from the exception > > > > > context as the starting point, so the exception entry code that deals > > > > > with the condition that triggered the oops is omitted, and does not > > > > > have to be unwound. > > > > > > > > That is true, but that's not really the case I was thinking about. > > > > I was thinking more about cases such as RCU stalls, soft lockups, > > > > etc. > > > > > > > > For example: > > > > > > > > https://www.linuxquestions.org/questions/linux-kernel-70/kenel-v4-4-60-panic-in-igmp6_send-and-and-__neigh_create-4175704721/ > > > > > > > > In that stack trace, the interesting bits are not the beginning of > > > > the stack trace down to __irq_svc, but everything beyond __irq_svc, > > > > since the lockup is probably caused by being stuck in > > > > _raw_write_lock_bh(). > > > > > > > > It's these situations that we will totally destroy debuggability for, > > > > and the only way around that would be to force frame pointers and > > > > ARM builds (not Thumb-2 as that requires the unwinder... which means > > > > a Thumb-2 kernel soft lockup would be undebuggable. > > > > > > > > > > Indeed. > > > > > > But that means that the only other choice we have is to retain the > > > imprecise nature of the current solution (which usually works fine > > > btw), and simply deal with the faulting double dereference of vsp in > > > the unwinder code. We simply don't know whether the exception was > > > taken at a point where the stack frame is consistent with the unwind > > > data. > > > > Okay, further analysis (for the record, since I've said much of this on > > IRC): > > > > What we have currently is a robust unwinder that will cope when things > > go wrong, such as an interrupt taken during the prologue of a function. > > The way it copes is by two mechanisms: > > > > /* store the highest address on the stack to avoid crossing it*/ > > low = frame->sp; > > ctrl.sp_high = ALIGN(low, THREAD_SIZE); > > > > These two represent the allowable bounds of the kernel stack. When we > > run the unwinder, before each unwind instruction we check whether the > > current SP value is getting close to the top of the kernel stack, and > > if so, turn on additional checking: > > > > if ((ctrl.sp_high - ctrl.vrs[SP]) < sizeof(ctrl.vrs)) > > ctrl.check_each_pop = 1; > > > > that will ensure if we go off the top of the kernel stack, the > > unwinder will report failure, and not access those addresses. > > > > After each instruction, we check whether the SP value is within the > > above bounds: > > > > if (ctrl.vrs[SP] < low || ctrl.vrs[SP] >= ctrl.sp_high) > > return -URC_FAILURE; > > > > This means that the unwinder can never modify SP to point outside of > > the kernel stack region identified by low..ctrl.sp_high, thereby > > protecting the load inside unwind_pop_register() from ever > > dereferencing something outside of the kernel stack. Moreover, it also > > prevents the unwinder modifying SP to point below the current stack > > frame. > > > > The problem has been introduced by trying to make the unwinder cope > > with IRQ stacks in b6506981f880 ("ARM: unwind: support unwinding across > > multiple stacks"): > > > > - if (!load_sp) > > + if (!load_sp) { > > ctrl->vrs[SP] = (unsigned long)vsp; > > + } else { > > + ctrl->sp_low = ctrl->vrs[SP]; > > + ctrl->sp_high = ALIGN(ctrl->sp_low, THREAD_SIZE); > > + } > > > > Now, whenever SP is loaded, we reset the allowable range for the SP > > value, and this completely defeats the protections we previously had > > which were ensuring that: > > > > 1) the SP value doesn't jump back _down_ the kernel stack resulting > > in an infinite unwind loop. > > 2) the SP value doesn't end up outside the kernel stack. > > > > We need those protections to prevent these problems that are being > > reported - and the most efficient way I can think of doing that is to > > somehow valudate the new SP value _before_ we modify sp_low and > > sp_high, so these two limits are always valid. > > > > Merely changing the READ_ONCE_NOCHECK() to be get_kernel_nocheck() > > will only partly fix this problem - it will stop the unwinder oopsing > > the kernel, but definitely doesn't protect against (1) and doesn't > > protect against SP pointing at some thing that is accessible (e.g. > > a device or other kernel memory.) > > > > We're yet again at Thursday, with the last linux-next prior to the > > merge window being created this evening, which really doesn't leave > > much time to get this sorted... and we can't say "this code should > > have been in earlier in the cycle" this time around, because these > > changes to the unwinder have been present in linux-next prior to > > 5.17-rc2. Annoyingly, it seems merging stuff earlier in the cycle > > doesn't actually solve the problem of these last minute debugging > > panics. > > > > Any suggestions for what we should do? Can we come up with some way > > to validate the new SP value before 6pm UTC this evening? > > Also, looking deeper at the last linaro job report: > > https://lkft.validation.linaro.org/scheduler/job/4688599#L684 > > the dumped memory doesn't look like an exception stack. If it was, > e82aab40 would be the saved CPSR value and c388eb80 would be the PC > value, both of which are nonsense. > > The stack that we were in (and we dumped out in full) was: > > Stack: (0xc381bb30 to 0xc381c000) > This is the IRQ stack that we switch to in [ 29.198048] irq_exit from __irq_svc+0x70/0x8c [ 29.200896] Exception stack(0xc5fcfc98 to 0xc5fcfce0) > and the exception stack (the saved pt_regs) is: > > r0 r1 r2 r3 r4 > bfa0: c2ba47c0 0000000a c2ba1358 ffff9537 c2c05d00 00400140 c62d5624 c1948b20 > r5 r6 r7 r8 r9 r10 fp ip > bfc0: e82ab498 c62d5400 c35377a0 c62d5404 25706000 c381bfe8 c62d5400 00000001 > sp lr pc cpsr orig_r0 > bfe0: c5fcfc48 c036251c c0995a14 20040013 ffffffff c5fcfc7c c62d5400 c0300bf0 > I don't follow this - which pt_regs is this? > but, we end up dumping out: > > Exception stack(0xc5fcfc98 to 0xc5fcfce0) > fc80: b6a3ab70 00000004 > fca0: 00000000 00000004 b6a3ab70 c055f928 c388eb80 c5fcfd40 00000000 c5fcfd50 > fcc0: 00000005 00000051 e82aad1c c03ae570 00000000 c388eb80 c3512a20 e82aab40 > > Firstly, that's in the wrong stack to be dumping for the exception > stack, and secondly, why is it 0x50 bytes above the saved SP value from > the real exception stack - that makes no sense in itself. > This is the exception stack that we were on when taking an IRQ and subsequently switching to the IRQ stack.