Re: dm-crypt performance regression due to workqueue changes

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Mon, 1 Jul 2024 14:08:50 +0100

On Sun, Jun 30, 2024 at 08:49:48PM +0200, Mikulas Patocka wrote:
> 
> 
> On Sun, 30 Jun 2024, Tejun Heo wrote:
> 
> > Hello,
> > 
> > On Sat, Jun 29, 2024 at 08:15:56PM +0200, Mikulas Patocka wrote:
> > 
> > > With 6.5, we get 3600MiB/s; with 6.6 we get 1400MiB/s.
> > > 
> > > The reason is that virt-manager by default sets up a topology where we 
> > > have 16 sockets, 1 core per socket, 1 thread per core. And that workqueue 
> > > patch avoids moving work items across sockets, so it processes all 
> > > encryption work only on one virtual CPU.
>
> > > The performance degradation may be fixed with "echo 'system'
> > > >/sys/module/workqueue/parameters/default_affinity_scope" - but it is 
> > > regression anyway, as many users don't know about this option.
> > > 
> > > How should we fix it? There are several options:
> > > 1. revert back to 'numa' affinity
> > > 2. revert to 'numa' affinity only if we are in a virtual machine
> > > 3. hack dm-crypt to set the 'numa' affinity for the affected workqueues
> > > 4. any other solution?
> > 
> > Do you happen to know why libvirt is doing that? There are many other
> > implications to configuring the system that way and I don't think we want to
> > design kernel behaviors to suit topology information fed to VMs which can be
> > arbitrary.
> > 
> > Thanks.
> 
> I don't know why. I added users@xxxxxxxxxxxxxxxxx to the CC.
> 
> How should libvirt properly advertise "we have 16 threads that are 
> dynamically scheduled by the host kernel, so the latencies between them 
> are changing and unpredictable"?

NB, libvirt is just control plane, the actual virtual hardware exposed
is implemented across QEMU and the KVM kernel mod. Guest CPU topology
and/or NUMA cost information is the responsibility of QEMU.

When QEMU's virtual CPUs are floating freely across host CPUs there's
no perfect answer. The host admin needs to make a tradeoff in their
configuration

They can optimize for density, by allowing guest CPUs to float freely
and allow CPU overcommit against host CPUs, and the guest CPU topology
is essentially a lie.

They can optimize for predictable performance, by strictly pinning
guest CPUs 1:1 to host CPUs, and minimize CPU overcommit, and have
the guest CPU topology 1:1 match the host CPU topology.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|