Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 19 Aug 2021 17:39:12 -0400 (EDT)

----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@xxxxxxxxxx wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the CPU ID needs to be updated if the
> task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

I agree with the problem assessment, but I would recommend a small change
to this fix.

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley <pefoley@xxxxxxxxxx>
> Bisected-by: Doug Evans <dje@xxxxxxxxxx>
> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> kernel/entry/kvm.c | 4 +++-
> kernel/rseq.c      | 4 ++--
> 2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
> 		if (ti_work & _TIF_NEED_RESCHED)
> 			schedule();
> 
> -		if (ti_work & _TIF_NOTIFY_RESUME)
> +		if (ti_work & _TIF_NOTIFY_RESUME) {
> 			tracehook_notify_resume(NULL);
> +			rseq_handle_notify_resume(NULL, NULL);
> +		}
> 
> 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
> 		if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..58c79a7918cd 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs
> *rseq_cs)
> 
> static int rseq_ip_fixup(struct pt_regs *regs)
> {
> -	unsigned long ip = instruction_pointer(regs);
> +	unsigned long ip = regs ? instruction_pointer(regs) : 0;
> 	struct task_struct *t = current;
> 	struct rseq_cs rseq_cs;
> 	int ret;
> @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> 	 * If not nested over a rseq critical section, restart is useless.
> 	 * Clear the rseq_cs pointer and return.
> 	 */
> -	if (!in_rseq_cs(ip, &rseq_cs))
> +	if (!regs || !in_rseq_cs(ip, &rseq_cs))

I think clearing the thread's rseq_cs unconditionally here when regs is NULL
is not the behavior we want when this is called from xfer_to_guest_mode_work.

If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
kill this application in the rseq_syscall handler when exiting back to usermode
when the ioctl eventually returns.

However, clearing the thread's rseq_cs will prevent this from happening.

So I would favor an approach where we simply do:

if (!regs)
     return 0;

Immediately at the beginning of rseq_ip_fixup, before getting the instruction
pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
is not relevant to do any fixup here, because it is nested in a ioctl system
call.

Effectively, this would preserve the SIGSEGV behavior when this ioctl is
erroneously called by user-space from a rseq critical section.

Thanks for looking into this !

Mathieu

> 		return clear_rseq_cs(t);
> 	ret = rseq_need_restart(t, rseq_cs.flags);
> 	if (ret <= 0)
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com