> On Jun 21, 2023, at 5:28 PM, Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Wed, Jun 21, 2023 at 03:26:22PM +0000, Chuck Lever III wrote: >> lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest >> contended lock on my test NFS client. The issue appears to be that the three >> NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all >> get placed in the same worker_pool, so they have to fight over one pool lock. >> >> I notice that ib_comp_wq is allocated with the same flags, but I don't see >> significant contention there, and a trace_printk in __queue_work shows that >> work items queued on that WQ seem to alternate between at least two different >> worker_pools. >> >> Is there a preferred way to ensure the NFS WQs get spread a little more fairly >> amongst the worker_pools? > > Can you share the output of lstopo on the test machine? Machine (P#0 total=32480548KB DMIProductName="Super Server" DMIProductVersion=0123456789 DMIBoardVendor=Supermicro DMIBoardName=X12SPL-F DMIBoardVersion=2.00 DMIBoardAssetTag="Base Board Asset Tag" DMIChassisVendor=Supermicro DMIChassisType=17 DMIChassisVersion=0123456789 DMIChassisAssetTag="Chassis Asset Tag" DMIBIOSVendor="American Megatrends International, LLC." DMIBIOSVersion=1.1a DMIBIOSDate=08/05/2021 DMISysVendor=Supermicro Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=6.4.0-rc7-00005-ga0c30c01f971 OSVersion="#8 SMP PREEMPT Wed Jun 21 11:29:02 EDT 2023" HostName=morisot.XXXXXXXXXXX.net Architecture=x86_64 hwlocVersion=2.5.0 ProcessName=lstopo) Package L#0 (P#0 total=32480548KB CPUVendor=GenuineIntel CPUFamilyNumber=6 CPUModelNumber=106 CPUModel="Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz" CPUStepping=6) NUMANode L#0 (P#0 local=32480548KB total=32480548KB) L3Cache L#0 (size=18432KB linesize=64 ways=12 Inclusive=0) L2Cache L#0 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#0 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#0 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#0 (P#0) PU L#0 (P#0) L2Cache L#1 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#1 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#1 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#1 (P#1) PU L#1 (P#1) L2Cache L#2 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#2 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#2 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#2 (P#2) PU L#2 (P#2) L2Cache L#3 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#3 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#3 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#3 (P#3) PU L#3 (P#3) L2Cache L#4 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#4 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#4 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#4 (P#4) PU L#4 (P#4) L2Cache L#5 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#5 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#5 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#5 (P#5) PU L#5 (P#5) L2Cache L#6 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#6 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#6 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#6 (P#6) PU L#6 (P#6) L2Cache L#7 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#7 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#7 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#7 (P#7) PU L#7 (P#7) L2Cache L#8 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#8 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#8 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#8 (P#8) PU L#8 (P#8) L2Cache L#9 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#9 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#9 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#9 (P#9) PU L#9 (P#9) L2Cache L#10 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#10 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#10 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#10 (P#10) PU L#10 (P#10) L2Cache L#11 (size=1280KB linesize=64 ways=20 Inclusive=0) L1dCache L#11 (size=48KB linesize=64 ways=12 Inclusive=0) L1iCache L#11 (size=32KB linesize=64 ways=8 Inclusive=0) Core L#11 (P#11) PU L#11 (P#11) HostBridge L#0 (buses=0000:[00-07]) PCI L#0 (busid=0000:00:11.5 id=8086:a1d2 class=0106(SATA)) Block(Removable Media Device) L#0 (Size=1048575 SectorSize=512 LinuxDeviceID=11:0 Model=ASUS_DRW-24F1ST_b Revision=1.00 SerialNumber=E5D0CL034213) "sr0" PCI L#1 (busid=0000:00:17.0 id=8086:a182 class=0106(SATA)) PCIBridge L#1 (busid=0000:00:1c.0 id=8086:a190 class=0604(PCIBridge) link=0.25GB/s buses=0000:[01-01]) PCI L#2 (busid=0000:01:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s) Network L#1 (Address=3c:ec:ef:7a:0b:fa) "eno1" PCIBridge L#2 (busid=0000:00:1c.1 id=8086:a191 class=0604(PCIBridge) link=0.25GB/s buses=0000:[02-02]) PCI L#3 (busid=0000:02:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s) Network L#2 (Address=3c:ec:ef:7a:0b:fb) "eno2" PCIBridge L#3 (busid=0000:00:1c.5 id=8086:a195 class=0604(PCIBridge) link=0.62GB/s buses=0000:[05-06]) PCIBridge L#4 (busid=0000:05:00.0 id=1a03:1150 class=0604(PCIBridge) link=0.62GB/s buses=0000:[06-06]) PCI L#4 (busid=0000:06:00.0 id=1a03:2000 class=0300(VGA)) PCIBridge L#5 (busid=0000:00:1d.0 id=8086:a198 class=0604(PCIBridge) link=3.94GB/s buses=0000:[07-07]) PCI L#5 (busid=0000:07:00.0 id=c0a9:540a class=0108(NVMExp) link=3.94GB/s) Block(Disk) L#3 (Size=244198584 SectorSize=512 LinuxDeviceID=259:0 Model=CT250P2SSD8 Revision=P2CR012 SerialNumber=2116E597CC4F) "nvme0n1" HostBridge L#6 (buses=0000:[17-18]) PCIBridge L#7 (busid=0000:17:02.0 id=8086:347a class=0604(PCIBridge) link=15.75GB/s buses=0000:[18-18]) PCI L#6 (busid=0000:18:00.0 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=6) Network L#4 (Address=ec:0d:9a:92:b2:46 Port=1) "ens6np0" OpenFabrics L#5 (NodeGUID=ec0d:9a03:0092:b246 SysImageGUID=ec0d:9a03:0092:b246 Port1State=4 Port1LID=0x0 Port1LMC=0 Port1GID0=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID1=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID2=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID3=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID4=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID5=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID6=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID7=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID8=fe80:0000:0000:0000:88dd:0692:352e:0cec Port1GID9=fe80:0000:0000:0000:88dd:0692:352e:0cec) "rocep24s0" HostBridge L#8 (buses=0000:[50-51]) PCIBridge L#9 (busid=0000:50:04.0 id=8086:347c class=0604(PCIBridge) link=15.75GB/s buses=0000:[51-51]) PCI L#7 (busid=0000:51:00.0 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4) Network L#6 (Address=00:00:05:f4:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0a Port=1) "ibs4f0" OpenFabrics L#7 (NodeGUID=b8ce:f603:0037:7a0a SysImageGUID=b8ce:f603:0037:7a0a Port1State=4 Port1LID=0xc Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0a) "ibp81s0f0" PCI L#8 (busid=0000:51:00.1 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4) Network L#8 (Address=00:00:03:d3:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0b Port=1) "ibs4f1" OpenFabrics L#9 (NodeGUID=b8ce:f603:0037:7a0b SysImageGUID=b8ce:f603:0037:7a0a Port1State=1 Port1LID=0xffff Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0b) "ibp81s0f1" depth 0: 1 Machine (type #0) depth 1: 1 Package (type #1) depth 2: 1 L3Cache (type #6) depth 3: 12 L2Cache (type #5) depth 4: 12 L1dCache (type #4) depth 5: 12 L1iCache (type #9) depth 6: 12 Core (type #2) depth 7: 12 PU (type #3) Special depth -3: 1 NUMANode (type #13) Special depth -4: 10 Bridge (type #14) Special depth -5: 9 PCIDev (type #15) Special depth -6: 10 OSDev (type #16) Memory attribute #2 name `Bandwidth' flags 5 NUMANode L#0 = 1790 from cpuset 0x00000fff (Machine L#0) Memory attribute #3 name `Latency' flags 6 NUMANode L#0 = 7600 from cpuset 0x00000fff (Machine L#0) CPU kind #0 efficiency 0 cpuset 0x00000fff FrequencyMaxMHz = 3300 > The following branch has pending workqueue changes which makes unbound > workqueues finer grained by default and a lot more flexible in how they're > segmented. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2 > > Can you please test with the brnach? If the default doesn't improve the > situation, you can set WQ_SYSFS on the affected workqueues and change their > scoping by writing to /sys/devices/virtual/WQ_NAME/affinity_scope. Please > take a look at > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v2#n350 > > for more details. I will give this a try. -- Chuck Lever