On 27/09/2017 00:49, Marcelo Tosatti wrote: > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote: >> On 25/09/2017 11:13, Peter Zijlstra wrote: >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: >>>> I think you are missing the following point: >>>> >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, >>>> otherwise it can't." >> >> Who says that? Certainly a driver can dedicate a single VCPU to >> periodic polling of the device, in such a way that the polling does not >> require a spinlock. > > This sequence: > > > VCPU-0 VCPU-1 (running realtime workload) > > takes spinlock A > scheduled out > spinlock(A) (busy spins until > VCPU-0 is scheduled > back in) > scheduled in > finishes execution of > code under protected section > releases spinlock(A) > > takes spinlock(A) Yes, but then you have busy waits for flag to be set polls device scheduled out device receives data ... scheduled in set flag or check for work to do scheduled out submit work busy waits for work to complete scheduled in do work None of which have anything to do with spinlocks. They're just different ways to get priority inversion, and this... >>>> So this point of "everything needs to be RT and the priorities must be >>>> designed carefully", is this: >>>> >>>> WHEN in spinlock protected section (more specifically, when >>>> spinlock protected section _shared with realtime vcpus_), >>>> >>>> priority of vcpu0 > priority of emulator thread >>>> >>>> OTHERWISE >>>> >>>> priority of vcpu0 < priority of emulator thread. >> >> This is _not_ designed carefully, this is messy. > > This is very precise to me. What is "messy" about it? (its clearly > defined). ... it's not how you design RT systems to avoid priority inversion. It's just solving an instance of the issue. >> The emulator thread can interrupt the VCPU thread, so it has to be at >> higher RT priority (+ priority inheritance of mutexes). > > It can only do that _when_ the VCPU thread is not running a critical > section which a higher priority task depends on. All VCPUs must have the same priority. There's no such thing as a housekeeping VCPU. There could be one sacrificial VCPU that you place on the same physical CPU as the emulator thread, but that's it. The whole system must run smoothly even if you place those on the same physical CPU, without any PV hacks. > So when you say "The emulator thread can interrupt the VCPU thread", > you're saying that it has to be modified to interrupt for a maximum > amount of time (say 15us). > > Is that what you are suggesting? This is correct. But I think you are missing a fundamental point, that is not specific to virt. If a thread 1 can trigger an event in thread 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N. In this case, the emulator thread can signal I/O completion to the VCPU: it doesn't matter if the VCPU is polling or using interrupts, the emulator thread must be placed at higher priority than the VCPU. This is the root cause of the SeaBIOS issue that you were seeing. Yes, it was me who suggested moving VCPU0 and the emulator thread to SCHED_NORMAL, but that was just a bandaid until the real fix was done (which is to set up SCHED_FIFO priorities correctly and use PI mutexes). In fact, this is not specific to the emulator thread. It applies just as well to vhost threads, to QEMU iothreads, even to the KVM PIT emulation thread if the guest were using it. > Paolo, you don't control how many interruptions of the emulator thread > happen per second. Indeed these non-VCPU threads should indeed run rarely and for a small amount of time only to achieve bounded latencies in the VCPUs. But if they don't, those are simply bugs and we fix them. In fact, sources of frequent interruptions have all been fixed or moved outside the main thread; for example, disks can use separate iothreads. Configuration problems are also bugs, just in a different place. For a very basic VM that I've just tried (whatever QEMU does by default, only tweak was not using the QEMU GUI), there is exactly one interruption per second when the VM is idle. That interruption in turn is caused by udev periodically probing the guest CDROM (!). So that's a configuration problem: remove the CDROM, and it does one interruption every 10-30 seconds, again all of them guest triggered. Again: if you have many interruptions, it's not a flaw in KVM or QEMU's design, it's just that someone is doing something stupid. It could be the guest (e.g. unnecessary devices or daemons as in the example above), QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second just to increment the clock), or the management (e.g. polling "is the VM running" 50 times per second). But it can and must be fixed. The design should be the basic one: attribute the right priority to each thread and things just work. Paolo > So if you let the emulator thread interrupt the > emulator thread at all times, without some kind of bounding > of these interruptions per time unit, you have a similar > problem as (*) (where the realtime task is scheduled). >