hi Michael, On 11/20/18 8:34 AM, Michael Ellerman wrote: > Hi Breno, > > Thanks for chasing this one down. > > Breno Leitao <leitao@xxxxxxxxxx> writes: > >> On a signal handler return, the user could set a context with MSR[TS] bits >> set, and these bits would be copied to task regs->msr. >> >> At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set, >> several __get_user() are called and then a recheckpoint is executed. >> >> This is a problem since a page fault (in kernel space) could happen when >> calling __get_user(). If it happens, the process MSR[TS] bits were >> already set, but recheckpoint was not executed, and SPRs are still invalid. >> >> The page fault can cause the current process to be de-scheduled, with >> MSR[TS] active and without tm_recheckpoint() being called. More >> importantly, without TEXAR[FS] bit set also. >> >> Since TEXASR might not have the FS bit set, and when the process is >> scheduled back, it will try to reclaim, which will be aborted because of >> the CPU is not in the suspended state, and, then, recheckpoint. This >> recheckpoint will restore thread->texasr into TEXASR SPR, which might be >> zero, hitting a BUG_ON(). >> >> [ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446! > > As Mikey said, would be good to have at least the stack trace & NIP > here, if not the full oops. Ack! >> This patch simply delays the MSR[TS] set, so, if there is any page fault in >> the __get_user() section, it does not have regs->msr[TS] set, since the TM >> structures are still invalid, thus avoiding doing TM operations for >> in-kernel exceptions and possible process reschedule. >> >> With this patch, the MSR[TS] will only be set just before recheckpointing >> and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in >> invalid state. > > To make this safe when PREEMPT is enabled we need to preempt_disable() / > enable() around the setting of regs->msr and the recheckpoint. > > That could also serve as nice documentation. > > I guess the other question is whether it should be the job of > tm_recheckpoint() to set regs->msr, given that it already hard disables > interrupts. > > eg. we'd set the TM flags in a local msr variable and pass the to > tm_recheckpoint(), it would then assign that to regs->msr in the IRQ > disabled section. Though there's many callers of tm_recheckpoint() that > don't need that behaviour, so it would probably need to be factored out. Right, that might be doable, but I am wondering if it isn't better to create a new function that does it as below: void tm_set_msr_and_recheckpoint(struct pt_regs *regs, u64 msr) { preempt_disable(); regs->msr = msr; tm_recheckpoint(); preempt_enable(); } I understand that preempt_disable() does a better work disabling preemption compared to disabling IRQ. Also, it does not change the API just for this very specific signal case. >> It is not possible to move tm_recheckpoint to happen earlier, because it is >> required to get the checkpointed registers from userspace, with >> __get_user(), thus, the only way to avoid this undesired behavior is >> delaying the MSR[TS] set, as done in this patch. > > I think the root cause here is that we're copying into the live regs of > current. That has obviously worked in the past, because the register > state wasn't used until we returned back to userspace. But that's no > longer true with TM. And even so it's quite subtle. I also suspect some > of our FP/VEC handling may not work correctly if we're scheduled part > way through restoring the regs. I got your point and I think we have a risk of corrupting the regs if MSR[TS] is set and we do page_fault->recheckpoint/recheclaim in the middle of __get_user() chunks. > What might work better is if we copy all the regs into temporary space > and then with interrupts disabled we copy them into the task. That way > we should never be scheduled with a half-populated set of regs. Yes, that seems to be the best strategy, and I might be interested in working on it. > That's obviously a much bigger patch though and something we'll have to > do later. > >> Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals") >> Cc: stable@xxxxxxxxxxxxxxx (v3.9+) >> Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx> >> --- >> arch/powerpc/kernel/signal_64.c | 29 +++++++++++++++-------------- >> 1 file changed, 15 insertions(+), 14 deletions(-) >> >> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c >> index 83d51bf586c7..15b153bdd826 100644 >> --- a/arch/powerpc/kernel/signal_64.c >> +++ b/arch/powerpc/kernel/signal_64.c >> @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, >> if (MSR_TM_RESV(msr)) >> return -EINVAL; >> >> - /* pull in MSR TS bits from user context */ >> - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); >> - >> - /* >> - * Ensure that TM is enabled in regs->msr before we leave the signal >> - * handler. It could be the case that (a) user disabled the TM bit >> - * through the manipulation of the MSR bits in uc_mcontext or (b) the >> - * TM bit was disabled because a sufficient number of context switches >> - * happened whilst in the signal handler and load_tm overflowed, >> - * disabling the TM bit. In either case we can end up with an illegal >> - * TM state leading to a TM Bad Thing when we return to userspace. >> - */ >> - regs->msr |= MSR_TM; >> - >> /* pull in MSR LE from user context */ >> regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE); >> >> @@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, >> tm_enable(); >> /* Make sure the transaction is marked as failed */ >> tsk->thread.tm_texasr |= TEXASR_FS; >> + > > preempt_disable(); > >> + /* pull in MSR TS bits from user context */ >> + regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); >> + >> + /* >> + * Ensure that TM is enabled in regs->msr before we leave the signal >> + * handler. It could be the case that (a) user disabled the TM bit >> + * through the manipulation of the MSR bits in uc_mcontext or (b) the >> + * TM bit was disabled because a sufficient number of context switches >> + * happened whilst in the signal handler and load_tm overflowed, >> + * disabling the TM bit. In either case we can end up with an illegal >> + * TM state leading to a TM Bad Thing when we return to userspace. >> + */ >> + regs->msr |= MSR_TM; >> + >> /* This loads the checkpointed FP/VEC state, if used */ >> tm_recheckpoint(&tsk->thread); >> > preempt_enable(); > > Although looking at the code that follows, it probably won't cope with > being preempted either. So the preempt_enable() should probably go at > the end of the function. Why? I confess I do not understand the preempt mechanism a lot. Does a preemption save/restore FP and vector states when it is preempted? What is the code/IRQ that is executed when there is a preemption? I am wondering if a preempt happens because the Decrementer (0x900) IRQ happens and a syscall is being executed, thus, saving the whole context and then restoring it. If my guess is correct, the IRQ might not save the FP/VEC states, thus, your concern about ahving load_fp_state() after a preemption. Let me paste the "patched" code here to help with the discussion: regs->msr |= MSR_TM; /* This loads the checkpointed FP/VEC state, if used */ tm_recheckpoint(&tsk->thread); msr_check_and_set(msr & (MSR_FP | MSR_VEC)); if (msr & MSR_FP) { load_fp_state(&tsk->thread.fp_state); regs->msr |= (MSR_FP | tsk->thread.fpexc_mode); } if (msr & MSR_VEC) { load_vr_state(&tsk->thread.vr_state); regs->msr |= MSR_VEC; } Thanks Breno