Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system

Andrei Vagin <avagin@xxxxxxxxx> · Tue, 26 Jul 2022 17:25:59 -0700

On Tue, Jul 26, 2022 at 03:10:34PM +0000, Sean Christopherson wrote:
> On Tue, Jul 26, 2022, Andrei Vagin wrote:
> > On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM.
> > >
> > > On Fri, Jul 22, 2022, Andrei Vagin wrote:
> > > > Another option is the KVM platform. In this case, the Sentry (gVisor
> > > > kernel) can run in a guest ring0 and create/manage multiple address
> > > > spaces. Its performance is much better than the ptrace one, but it is
> > > > still not great compared with the native performance. This change
> > > > optimizes the most critical part, which is the syscall overhead.
> > >
> > > What exactly is the source of the syscall overhead,
> > 
> > Here are perf traces for two cases: when "guest" syscalls are executed via
> > hypercalls and when syscalls are executed by the user-space VMM:
> > https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e
> > 
> > And here are two tests that I use to collect these traces:
> > https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99
> > 
> > If we compare these traces, we can find that in the second case, we spend extra
> > time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put,
> > syscall_exit_to_user_mode.
> 
> So of those, I think the only path a robust implementation can actually avoid,
> without significantly whittling down the allowed set of syscalls, is
> syscall_exit_to_user_mode().
> 
> The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run
> through that before calling out of KVM.  E.g. prctrl(ARCH_GET_GS) will read the
> wrong GS.base if MSR_KERNEL_GS_BASE isn't restored.  And that necessitates
> calling vmx_prepare_switch_to_guest() when resuming the vCPU.
> 
> FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound
> to be a syscall that accesses user FPU state and will do the wrong thing if guest
> state is loaded.
> 
> For gVisor, that's all presumably a non-issue because it uses a small set of
> syscalls (or has guest==host state?), but for a common KVM feature it's problematic.

I think the number of system calls that touch a state that is
intersected with KVM is very limited and we can blocklist all of them.
Another option is to have an allow list of system calls to be sure that
we don't miss anything.

> 
> > > and what alternatives have been explored?  Making arbitrary syscalls from
> > > within KVM is mildly terrifying.
> > 
> > "mildly terrifying" is a good sentence in this case:). If I were in your place,
> > I would think about it similarly.
> > 
> > I understand these concerns about calling syscalls from the KVM code, and this
> > is why I hide this feature under a separate capability that can be enabled
> > explicitly.
> > 
> > We can think about restricting the list of system calls that this hypercall can
> > execute. In the user-space changes for gVisor, we have a list of system calls
> > that are not executed via this hypercall.
> 
> Can you provide that list?

Here is the list that are not executed via this hypercall:
clone, exit, exit_group, ioctl, rt_sigreturn, mmap, arch_prctl,
sigprocmask.

And here is the list of all system calls that we allow for the Sentry:
clock_gettime, close, dup, dup3, epoll_create1, epoll_ctl, epoll_pwait,
eventfd2, exit, exit_group, fallocate, fchmod, fcntl, fstat, fsync,
ftruncate, futex, getcpu, getpid, getrandom, getsockopt, gettid,
gettimeofday, ioctl, lseek, madvise, membarrier, mincore, mmap,
mprotect, munmap, nanosleep, ppol, pread64, preadv, preadv2, pwrite64,
pwritev, pwritev2, read, recvmsg, recvmmsg, sendmsg, sendmmsg,
restart_syscall, rt_sigaction, rt_sigprocmask, rt_sigreturn,
sched_yield, settimer, shutdown, sigaltstack, statx, sync_file_range,
tee, timer_create, timer_delete, timer_settime, tgkill, utimensat,
write, writev.

> 
> > But it has downsides:
> > * Each sentry system call trigger the full exit to hr3.
> > * Each vmenter/vmexit requires to trigger a signal but it is expensive.
> 
> Can you explain this one?  I didn't quite follow what this is referring to.

In my message, there was the explanation of how the gVisor KVM platform
works right now, and here are two points why it is slow.

Each time when the Sentry triggers a system call, it has to switch to
the host ring3.

When the Sentry wants to switch to the guest ring0, it triggers a signal to
fall in a signal handler. There, we have a sigcontext that we use to get
the current thread state to resume execution in gr0, and then when the
Sentry needs to switch back to hr3, we set the sentry state from gr0 to
sigcontext and returns from the signal handler.

> 
> > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
> >   has to be fully enclosed in a VM to be able to support these technologies.
> 
> Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
> tackled the "syscalls are crazy expensive" problem by using a message queue and
> a dedicated task outside of the enclave to handle syscalls.  Would something like
> that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
> non-starter?

Context-switching is expensive... There was a few attempts to implement
synchronous context-switching ([1], [2]) that can help in this case,
but even with this sort of optimizations, it is too expensive.

1. https://lwn.net/Articles/824409/
2. https://www.spinics.net/lists/linux-api/msg50417.html

Thanks,
Andrei