Hi Breno, Thanks for chasing this one down. Breno Leitao <leitao@xxxxxxxxxx> writes: > On a signal handler return, the user could set a context with MSR[TS] bits > set, and these bits would be copied to task regs->msr. > > At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set, > several __get_user() are called and then a recheckpoint is executed. > > This is a problem since a page fault (in kernel space) could happen when > calling __get_user(). If it happens, the process MSR[TS] bits were > already set, but recheckpoint was not executed, and SPRs are still invalid. > > The page fault can cause the current process to be de-scheduled, with > MSR[TS] active and without tm_recheckpoint() being called. More > importantly, without TEXAR[FS] bit set also. > > Since TEXASR might not have the FS bit set, and when the process is > scheduled back, it will try to reclaim, which will be aborted because of > the CPU is not in the suspended state, and, then, recheckpoint. This > recheckpoint will restore thread->texasr into TEXASR SPR, which might be > zero, hitting a BUG_ON(). > > [ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446! As Mikey said, would be good to have at least the stack trace & NIP here, if not the full oops. > This patch simply delays the MSR[TS] set, so, if there is any page fault in > the __get_user() section, it does not have regs->msr[TS] set, since the TM > structures are still invalid, thus avoiding doing TM operations for > in-kernel exceptions and possible process reschedule. > > With this patch, the MSR[TS] will only be set just before recheckpointing > and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in > invalid state. To make this safe when PREEMPT is enabled we need to preempt_disable() / enable() around the setting of regs->msr and the recheckpoint. That could also serve as nice documentation. I guess the other question is whether it should be the job of tm_recheckpoint() to set regs->msr, given that it already hard disables interrupts. eg. we'd set the TM flags in a local msr variable and pass the to tm_recheckpoint(), it would then assign that to regs->msr in the IRQ disabled section. Though there's many callers of tm_recheckpoint() that don't need that behaviour, so it would probably need to be factored out. > It is not possible to move tm_recheckpoint to happen earlier, because it is > required to get the checkpointed registers from userspace, with > __get_user(), thus, the only way to avoid this undesired behavior is > delaying the MSR[TS] set, as done in this patch. I think the root cause here is that we're copying into the live regs of current. That has obviously worked in the past, because the register state wasn't used until we returned back to userspace. But that's no longer true with TM. And even so it's quite subtle. I also suspect some of our FP/VEC handling may not work correctly if we're scheduled part way through restoring the regs. What might work better is if we copy all the regs into temporary space and then with interrupts disabled we copy them into the task. That way we should never be scheduled with a half-populated set of regs. That's obviously a much bigger patch though and something we'll have to do later. > Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals") > Cc: stable@xxxxxxxxxxxxxxx (v3.9+) > Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx> > --- > arch/powerpc/kernel/signal_64.c | 29 +++++++++++++++-------------- > 1 file changed, 15 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index 83d51bf586c7..15b153bdd826 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, > if (MSR_TM_RESV(msr)) > return -EINVAL; > > - /* pull in MSR TS bits from user context */ > - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); > - > - /* > - * Ensure that TM is enabled in regs->msr before we leave the signal > - * handler. It could be the case that (a) user disabled the TM bit > - * through the manipulation of the MSR bits in uc_mcontext or (b) the > - * TM bit was disabled because a sufficient number of context switches > - * happened whilst in the signal handler and load_tm overflowed, > - * disabling the TM bit. In either case we can end up with an illegal > - * TM state leading to a TM Bad Thing when we return to userspace. > - */ > - regs->msr |= MSR_TM; > - > /* pull in MSR LE from user context */ > regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE); > > @@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, > tm_enable(); > /* Make sure the transaction is marked as failed */ > tsk->thread.tm_texasr |= TEXASR_FS; > + preempt_disable(); > + /* pull in MSR TS bits from user context */ > + regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); > + > + /* > + * Ensure that TM is enabled in regs->msr before we leave the signal > + * handler. It could be the case that (a) user disabled the TM bit > + * through the manipulation of the MSR bits in uc_mcontext or (b) the > + * TM bit was disabled because a sufficient number of context switches > + * happened whilst in the signal handler and load_tm overflowed, > + * disabling the TM bit. In either case we can end up with an illegal > + * TM state leading to a TM Bad Thing when we return to userspace. > + */ > + regs->msr |= MSR_TM; > + > /* This loads the checkpointed FP/VEC state, if used */ > tm_recheckpoint(&tsk->thread); > preempt_enable(); Although looking at the code that follows, it probably won't cope with being preempted either. So the preempt_enable() should probably go at the end of the function. cheers