On Mon, Jul 01, 2024 at 02:48:07PM +0200, Michal Prívozník wrote: > On 6/30/24 20:49, Mikulas Patocka wrote: > > > > > > On Sun, 30 Jun 2024, Tejun Heo wrote: > > > >> Hello, > >> > >> On Sat, Jun 29, 2024 at 08:15:56PM +0200, Mikulas Patocka wrote: > >> > >>> With 6.5, we get 3600MiB/s; with 6.6 we get 1400MiB/s. > >>> > >>> The reason is that virt-manager by default sets up a topology where we > >>> have 16 sockets, 1 core per socket, 1 thread per core. And that workqueue > >>> patch avoids moving work items across sockets, so it processes all > >>> encryption work only on one virtual CPU. > >>> > >>> The performance degradation may be fixed with "echo 'system' > >>>> /sys/module/workqueue/parameters/default_affinity_scope" - but it is > >>> regression anyway, as many users don't know about this option. > >>> > >>> How should we fix it? There are several options: > >>> 1. revert back to 'numa' affinity > >>> 2. revert to 'numa' affinity only if we are in a virtual machine > >>> 3. hack dm-crypt to set the 'numa' affinity for the affected workqueues > >>> 4. any other solution? > >> > >> Do you happen to know why libvirt is doing that? There are many other > >> implications to configuring the system that way and I don't think we want to > >> design kernel behaviors to suit topology information fed to VMs which can be > >> arbitrary. > > Firstly, libvirt's not doing anything. It very specifically avoids doing > policy decisions. If something configures vCPUs so that they are in > separate sockets, then we should look at that something. Alternatively, > if "default" configuration does not work for your workflow well, > document recommended configuration. Actually in this particular case, it is strictly speaking libvirt. If the guest XML config does not mention any <topology> info, then libvirt explicitly tells QEMU to set sockets=N,cores=1,threads=1. That matches QEMU's own historical built-in default topology. None the less, my advice for mgmt applications using libvirt would likely be to explicitly request sockets=1,cores=N,threads=1. This is because it gives slightly better compatibility with unpleasant software that applies licensing / subscription rules that penalize use of many sockets, while being happy with any number of cores. Either way though, the topology is a lie when the guest CPUs are not pinned to host CPUs, so making performance decisions based on this is unlikely to yield the desired results. Historically the cores vs sockets distinction hasn't seemed to make much difference to guest OS performance, as the OS' haven't made significant decisions on this axis. Exposing threads != 1 though has always been a big no though, unless strictly pinning 1:1 guest:host CPUs, as that has had notable impacts on scheduling decisions. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|