Re: dm-crypt performance regression due to workqueue changes

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Mon, 1 Jul 2024 14:52:19 +0100

On Mon, Jul 01, 2024 at 03:42:29PM +0200, Mikulas Patocka wrote:
> 
> 
> On Mon, 1 Jul 2024, Daniel P. Berrangé wrote:
> 
> > On Sun, Jun 30, 2024 at 08:49:48PM +0200, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Sun, 30 Jun 2024, Tejun Heo wrote:
> > > 
> > > > Do you happen to know why libvirt is doing that? There are many other
> > > > implications to configuring the system that way and I don't think we want to
> > > > design kernel behaviors to suit topology information fed to VMs which can be
> > > > arbitrary.
> > > > 
> > > > Thanks.
> > > 
> > > I don't know why. I added users@xxxxxxxxxxxxxxxxx to the CC.
> > > 
> > > How should libvirt properly advertise "we have 16 threads that are 
> > > dynamically scheduled by the host kernel, so the latencies between them 
> > > are changing and unpredictable"?
> > 
> > NB, libvirt is just control plane, the actual virtual hardware exposed
> > is implemented across QEMU and the KVM kernel mod. Guest CPU topology
> > and/or NUMA cost information is the responsibility of QEMU.
> > 
> > When QEMU's virtual CPUs are floating freely across host CPUs there's
> > no perfect answer. The host admin needs to make a tradeoff in their
> > configuration
> > 
> > They can optimize for density, by allowing guest CPUs to float freely
> > and allow CPU overcommit against host CPUs, and the guest CPU topology
> > is essentially a lie.
> > 
> > They can optimize for predictable performance, by strictly pinning
> > guest CPUs 1:1 to host CPUs, and minimize CPU overcommit, and have
> > the guest CPU topology 1:1 match the host CPU topology.
> 
> The problem that we have here is that the commit 
> 63c5484e74952f60f5810256bd69814d167b8d22 ("workqueue: Add multiple 
> affinity scopes and interface to select them") changes the behavior of 
> unbound workqueues, so that work items are only executed on CPUs that 
> share last level cache with the task that submitted them.
> 
> If there are 16 virtual CPUs that are freely floating across physical 
> CPUs, virt-manager by default selects a topology where it advertises 16 
> sockets, 1 CPU per socket, 1 thread per CPU. The result is that the 
> unbound workqueues are no longer unbound, they can't move work across 
> sockets and they are bound to just one virtual CPU, causing dm-crypt 
> performance degradation. (the crypto operations are no longer 
> parallelized).
> 
> Whose bug is this? Is it a bug in virt-manager because it advertises 
> invalid topology? Is this a bug in that patch 63c5484e7495 because it 
> avoids moving work items across sockets?

It is hard to call it is a bug in anything. The Linux patch is reasonable
in honouring the CPU topology. The hypervisor is reasonable to exposing
sockets=N,cores=1 as there's no right answer for the topology choice, and
it has no idea how it may or may not impact guest OS behaviour or perf.
Letting CPUs float freely is sensible default behaviour too. All of them
conspire to have a perf impact here, but the deployment is not seeking
to maximise performance, rather to maximise flexibility & density.

None the less, I'd suggest that virt-manager should be a explicitly
asking for sockets=1,cores=N, as that has broader guest OS compatibility.
By chance that would also help this scenario, but that woudn't be a
driving factor as we can't pick defaults based on the needs of particular
versions of a particular guest kernel.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|