On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote: > On 27/09/2017 00:49, Marcelo Tosatti wrote: > > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote: > >> On 25/09/2017 11:13, Peter Zijlstra wrote: > >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: > >>>> I think you are missing the following point: > >>>> > >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, > >>>> otherwise it can't." > >> > >> Who says that? Certainly a driver can dedicate a single VCPU to > >> periodic polling of the device, in such a way that the polling does not > >> require a spinlock. > > > > This sequence: > > > > > > VCPU-0 VCPU-1 (running realtime workload) > > > > takes spinlock A > > scheduled out > > spinlock(A) (busy spins until > > VCPU-0 is scheduled > > back in) > > scheduled in > > finishes execution of > > code under protected section > > releases spinlock(A) > > > > takes spinlock(A) > > Yes, but then you have > > busy waits for flag to be set > polls device > scheduled out > device receives data > ... > scheduled in > set flag > > or > > check for work to do > scheduled out > submit work > busy waits for work to complete > scheduled in > do work > > None of which have anything to do with spinlocks. They're just > different ways to get priority inversion, and this... > > >>>> So this point of "everything needs to be RT and the priorities must be > >>>> designed carefully", is this: > >>>> > >>>> WHEN in spinlock protected section (more specifically, when > >>>> spinlock protected section _shared with realtime vcpus_), > >>>> > >>>> priority of vcpu0 > priority of emulator thread > >>>> > >>>> OTHERWISE > >>>> > >>>> priority of vcpu0 < priority of emulator thread. > >> > >> This is _not_ designed carefully, this is messy. > > > > This is very precise to me. What is "messy" about it? (its clearly > > defined). > > ... it's not how you design RT systems to avoid priority inversion. > It's just solving an instance of the issue. > > >> The emulator thread can interrupt the VCPU thread, so it has to be at > >> higher RT priority (+ priority inheritance of mutexes). > > > > It can only do that _when_ the VCPU thread is not running a critical > > section which a higher priority task depends on. > > All VCPUs must have the same priority. There's no such thing as a > housekeeping VCPU. > > There could be one sacrificial VCPU that you place on the same physical > CPU as the emulator thread, but that's it. The whole system must run > smoothly even if you place those on the same physical CPU, without any > PV hacks. > > > So when you say "The emulator thread can interrupt the VCPU thread", > > you're saying that it has to be modified to interrupt for a maximum > > amount of time (say 15us). > > > > Is that what you are suggesting? > > This is correct. But I think you are missing a fundamental point, that > is not specific to virt. If a thread 1 can trigger an event in thread > 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N. > > In this case, the emulator thread can signal I/O completion to the VCPU: > it doesn't matter if the VCPU is polling or using interrupts, the > emulator thread must be placed at higher priority than the VCPU. This > is the root cause of the SeaBIOS issue that you were seeing. Yes, it > was me who suggested moving VCPU0 and the emulator thread to > SCHED_NORMAL, but that was just a bandaid until the real fix was done > (which is to set up SCHED_FIFO priorities correctly and use PI mutexes). > > In fact, this is not specific to the emulator thread. It applies just > as well to vhost threads, to QEMU iothreads, even to the KVM PIT > emulation thread if the guest were using it. > > > Paolo, you don't control how many interruptions of the emulator thread > > happen per second. > > Indeed these non-VCPU threads should indeed run rarely and for a small > amount of time only to achieve bounded latencies in the VCPUs. But if > they don't, those are simply bugs and we fix them. In fact, sources of > frequent interruptions have all been fixed or moved outside the main > thread; for example, disks can use separate iothreads. Configuration > problems are also bugs, just in a different place. > > For a very basic VM that I've just tried (whatever QEMU does by default, > only tweak was not using the QEMU GUI), there is exactly one > interruption per second when the VM is idle. That interruption in turn > is caused by udev periodically probing the guest CDROM (!). So that's a > configuration problem: remove the CDROM, and it does one interruption > every 10-30 seconds, again all of them guest triggered. > > Again: if you have many interruptions, it's not a flaw in KVM or QEMU's > design, it's just that someone is doing something stupid. It could be > the guest (e.g. unnecessary devices or daemons as in the example above), > QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second > just to increment the clock), or the management (e.g. polling "is the VM > running" 50 times per second). But it can and must be fixed. No, i mean you can run anything in VCPU-0 (it is valid to do that). And that "anything" can generate 1 interrupt per second, 1000 or 10.000 interrupts per second. Which are all valid things to be done. "I can't run a kernel compilation on VCPU-0 because that will impact latency on the realtime VCPU-1" is not acceptable. > The design should be the basic one: attribute the right priority to each > thread and things just work. > > Paolo > > > So if you let the emulator thread interrupt the > > emulator thread at all times, without some kind of bounding > > of these interruptions per time unit, you have a similar > > problem as (*) (where the realtime task is scheduled). > >