----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@xxxxxxxxxx wrote: > Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to > transferring to a KVM guest, which is roughly equivalent to an exit to > userspace and processes many of the same pending actions. While the task > cannot be in an rseq critical section as the KVM path is reachable only > via ioctl(KVM_RUN), the side effects that apply to rseq outside of a > critical section still apply, e.g. the CPU ID needs to be updated if the > task is migrated. > > Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults > and other badness in userspace VMMs that use rseq in combination with KVM, > e.g. due to the CPU ID being stale after task migration. I agree with the problem assessment, but I would recommend a small change to this fix. > > Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") > Reported-by: Peter Foley <pefoley@xxxxxxxxxx> > Bisected-by: Doug Evans <dje@xxxxxxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > kernel/entry/kvm.c | 4 +++- > kernel/rseq.c | 4 ++-- > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c > index 49972ee99aff..049fd06b4c3d 100644 > --- a/kernel/entry/kvm.c > +++ b/kernel/entry/kvm.c > @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, > unsigned long ti_work) > if (ti_work & _TIF_NEED_RESCHED) > schedule(); > > - if (ti_work & _TIF_NOTIFY_RESUME) > + if (ti_work & _TIF_NOTIFY_RESUME) { > tracehook_notify_resume(NULL); > + rseq_handle_notify_resume(NULL, NULL); > + } > > ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); > if (ret) > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 35f7bd0fced0..58c79a7918cd 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs > *rseq_cs) > > static int rseq_ip_fixup(struct pt_regs *regs) > { > - unsigned long ip = instruction_pointer(regs); > + unsigned long ip = regs ? instruction_pointer(regs) : 0; > struct task_struct *t = current; > struct rseq_cs rseq_cs; > int ret; > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) > * If not nested over a rseq critical section, restart is useless. > * Clear the rseq_cs pointer and return. > */ > - if (!in_rseq_cs(ip, &rseq_cs)) > + if (!regs || !in_rseq_cs(ip, &rseq_cs)) I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work. If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns. However, clearing the thread's rseq_cs will prevent this from happening. So I would favor an approach where we simply do: if (!regs) return 0; Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call. Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section. Thanks for looking into this ! Mathieu > return clear_rseq_cs(t); > ret = rseq_need_restart(t, rseq_cs.flags); > if (ret <= 0) > -- > 2.33.0.rc1.237.g0d66db33f3-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com