Re: Timer Signals vs KVM

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 1 Apr 2024 15:22:31 -0700

On Wed, Mar 27, 2024, Julian Stecklina wrote:
> Hey everyone,
> 
> we are developing the KVM backend for VirtualBox [0] and wanted to reach out
> regarding some weird behavior.
> 
> We are using `timer_create` to deliver timer events to vCPU threads as signals.
> We mask the signal using pthread_sigmask in the host vCPU thread and unmask them
> for guest execution using KVM_SET_SIGNAL_MASK.

What exactly do you mean by "timer events"?  From the split-lock blog post, it
does NOT seem like you're emulating guest timer events.  Specifically, this

  Consider that we want to run a KVM vCPU on Linux, but we want it to
  unconditionally exit after 1ms regardless of what the guest does.

sounds like you're doing vCPU scheduling in userspace.  But the above

  as opposed to using a separate thread that handles timers

doesn't really mesh with that.

> This method of handling timers works well and gives us very low latency as
> opposed to using a separate thread that handles timers. As far as we can tell,
> neither Qemu nor other VMMs use such a setup. We see two issues:
> 
> When we enable nested virtualization, we see what looks like corruption in the
> nested guest. The guest trips over exceptions that shouldn't be there. We are
> currently debugging this to find out details, but the setup is pretty painful
> and it will take a bit. If we disable the timer signals, this issue goes away
> (at the cost of broken VBox timers obviously...).  This is weird and has left us
> wondering, whether there might be something broken with signals in this
> scenario, especially since none of the other VMMs uses this method.

It's certainly possible there's a kernel bug, but it's probably more likely a
problem in your userspace.  QEMU (and others VMMs) do use signals to interrupt
vCPUs, e.g. to take control for live migration.  That's obviously different than
what you're doing, and will have orders of magnitude lower volume of signals in
nested guests, but the effective coverage isn't "zero".

> The other issue is that we have a somewhat sad interaction with split-lock

LOL, I think the "sad" part is redundant.  I've yet to have any iteraction with
split-lock detection that wasn't sad. :-)

> detection, which I've blogged about some time ago [1]. Long story short: When
> you program timers <10ms into the future, you run the risk of making no progress
> anymore when the guest triggers the split-lock punishment [2]. See the blog post
> for details. I was wondering whether there is a better solution here than
> disabling the split-lock detection or whether our approach here is fundamentally
> broken.

I'm pretty sure disabling split-lock is just whacking one mole, there will be many
more lurking.  AIUI, timer_create() provides a per process timer, i.e. a timer
which counts even if a task (i.e. a vCPU) is scheduled out.  The split-lock issue
is the most blatant problem because it's (a) 100% deterministic and (b) tied to
guest code.  But any other paths that might_sleep() are going to be problematic,
albeit far less likely to completely block forward progress.

I don't really see a sane way around that, short of actually having a userspace
component that knows how long a task/vCPU has actually run.