On Tue, Jul 26, 2022, Andrei Vagin wrote: > On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM. > > > > On Fri, Jul 22, 2022, Andrei Vagin wrote: > > > Another option is the KVM platform. In this case, the Sentry (gVisor > > > kernel) can run in a guest ring0 and create/manage multiple address > > > spaces. Its performance is much better than the ptrace one, but it is > > > still not great compared with the native performance. This change > > > optimizes the most critical part, which is the syscall overhead. > > > > What exactly is the source of the syscall overhead, > > Here are perf traces for two cases: when "guest" syscalls are executed via > hypercalls and when syscalls are executed by the user-space VMM: > https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e > > And here are two tests that I use to collect these traces: > https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99 > > If we compare these traces, we can find that in the second case, we spend extra > time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put, > syscall_exit_to_user_mode. So of those, I think the only path a robust implementation can actually avoid, without significantly whittling down the allowed set of syscalls, is syscall_exit_to_user_mode(). The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run through that before calling out of KVM. E.g. prctrl(ARCH_GET_GS) will read the wrong GS.base if MSR_KERNEL_GS_BASE isn't restored. And that necessitates calling vmx_prepare_switch_to_guest() when resuming the vCPU. FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound to be a syscall that accesses user FPU state and will do the wrong thing if guest state is loaded. For gVisor, that's all presumably a non-issue because it uses a small set of syscalls (or has guest==host state?), but for a common KVM feature it's problematic. > > and what alternatives have been explored? Making arbitrary syscalls from > > within KVM is mildly terrifying. > > "mildly terrifying" is a good sentence in this case:). If I were in your place, > I would think about it similarly. > > I understand these concerns about calling syscalls from the KVM code, and this > is why I hide this feature under a separate capability that can be enabled > explicitly. > > We can think about restricting the list of system calls that this hypercall can > execute. In the user-space changes for gVisor, we have a list of system calls > that are not executed via this hypercall. Can you provide that list? > But it has downsides: > * Each sentry system call trigger the full exit to hr3. > * Each vmenter/vmexit requires to trigger a signal but it is expensive. Can you explain this one? I didn't quite follow what this is referring to. > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > has to be fully enclosed in a VM to be able to support these technologies. Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC tackled the "syscalls are crazy expensive" problem by using a message queue and a dedicated task outside of the enclave to handle syscalls. Would something like that work, or is having to burn a pCPU (or more) to handle syscalls in the host a non-starter?