Re: contention on pwq->pool->lock under heavy NFS workload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jun 21, 2023, at 5:28 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> 
> Hello,
> 
> On Wed, Jun 21, 2023 at 03:26:22PM +0000, Chuck Lever III wrote:
>> lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest
>> contended lock on my test NFS client. The issue appears to be that the three
>> NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all
>> get placed in the same worker_pool, so they have to fight over one pool lock.
>> 
>> I notice that ib_comp_wq is allocated with the same flags, but I don't see
>> significant contention there, and a trace_printk in __queue_work shows that
>> work items queued on that WQ seem to alternate between at least two different
>> worker_pools.
>> 
>> Is there a preferred way to ensure the NFS WQs get spread a little more fairly
>> amongst the worker_pools?
> 
> Can you share the output of lstopo on the test machine?

Machine (P#0 total=32480548KB DMIProductName="Super Server" DMIProductVersion=0123456789 DMIBoardVendor=Supermicro DMIBoardName=X12SPL-F DMIBoardVersion=2.00 DMIBoardAssetTag="Base Board Asset Tag" DMIChassisVendor=Supermicro DMIChassisType=17 DMIChassisVersion=0123456789 DMIChassisAssetTag="Chassis Asset Tag" DMIBIOSVendor="American Megatrends International, LLC." DMIBIOSVersion=1.1a DMIBIOSDate=08/05/2021 DMISysVendor=Supermicro Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=6.4.0-rc7-00005-ga0c30c01f971 OSVersion="#8 SMP PREEMPT Wed Jun 21 11:29:02 EDT 2023" HostName=morisot.XXXXXXXXXXX.net Architecture=x86_64 hwlocVersion=2.5.0 ProcessName=lstopo)
  Package L#0 (P#0 total=32480548KB CPUVendor=GenuineIntel CPUFamilyNumber=6 CPUModelNumber=106 CPUModel="Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz" CPUStepping=6)
    NUMANode L#0 (P#0 local=32480548KB total=32480548KB)
    L3Cache L#0 (size=18432KB linesize=64 ways=12 Inclusive=0)
      L2Cache L#0 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#0 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#0 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#0 (P#0)
              PU L#0 (P#0)
      L2Cache L#1 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#1 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#1 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#1 (P#1)
              PU L#1 (P#1)
      L2Cache L#2 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#2 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#2 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#2 (P#2)
              PU L#2 (P#2)
      L2Cache L#3 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#3 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#3 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#3 (P#3)
              PU L#3 (P#3)
      L2Cache L#4 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#4 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#4 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#4 (P#4)
              PU L#4 (P#4)
      L2Cache L#5 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#5 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#5 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#5 (P#5)
              PU L#5 (P#5)
      L2Cache L#6 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#6 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#6 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#6 (P#6)
              PU L#6 (P#6)
      L2Cache L#7 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#7 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#7 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#7 (P#7)
              PU L#7 (P#7)
      L2Cache L#8 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#8 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#8 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#8 (P#8)
              PU L#8 (P#8)
      L2Cache L#9 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#9 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#9 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#9 (P#9)
              PU L#9 (P#9)
      L2Cache L#10 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#10 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#10 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#10 (P#10)
              PU L#10 (P#10)
      L2Cache L#11 (size=1280KB linesize=64 ways=20 Inclusive=0)
        L1dCache L#11 (size=48KB linesize=64 ways=12 Inclusive=0)
          L1iCache L#11 (size=32KB linesize=64 ways=8 Inclusive=0)
            Core L#11 (P#11)
              PU L#11 (P#11)
  HostBridge L#0 (buses=0000:[00-07])
    PCI L#0 (busid=0000:00:11.5 id=8086:a1d2 class=0106(SATA))
      Block(Removable Media Device) L#0 (Size=1048575 SectorSize=512 LinuxDeviceID=11:0 Model=ASUS_DRW-24F1ST_b Revision=1.00 SerialNumber=E5D0CL034213) "sr0"
    PCI L#1 (busid=0000:00:17.0 id=8086:a182 class=0106(SATA))
    PCIBridge L#1 (busid=0000:00:1c.0 id=8086:a190 class=0604(PCIBridge) link=0.25GB/s buses=0000:[01-01])
      PCI L#2 (busid=0000:01:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s)
        Network L#1 (Address=3c:ec:ef:7a:0b:fa) "eno1"
    PCIBridge L#2 (busid=0000:00:1c.1 id=8086:a191 class=0604(PCIBridge) link=0.25GB/s buses=0000:[02-02])
      PCI L#3 (busid=0000:02:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s)
        Network L#2 (Address=3c:ec:ef:7a:0b:fb) "eno2"
    PCIBridge L#3 (busid=0000:00:1c.5 id=8086:a195 class=0604(PCIBridge) link=0.62GB/s buses=0000:[05-06])
      PCIBridge L#4 (busid=0000:05:00.0 id=1a03:1150 class=0604(PCIBridge) link=0.62GB/s buses=0000:[06-06])
        PCI L#4 (busid=0000:06:00.0 id=1a03:2000 class=0300(VGA))
    PCIBridge L#5 (busid=0000:00:1d.0 id=8086:a198 class=0604(PCIBridge) link=3.94GB/s buses=0000:[07-07])
      PCI L#5 (busid=0000:07:00.0 id=c0a9:540a class=0108(NVMExp) link=3.94GB/s)
        Block(Disk) L#3 (Size=244198584 SectorSize=512 LinuxDeviceID=259:0 Model=CT250P2SSD8 Revision=P2CR012 SerialNumber=2116E597CC4F) "nvme0n1"
  HostBridge L#6 (buses=0000:[17-18])
    PCIBridge L#7 (busid=0000:17:02.0 id=8086:347a class=0604(PCIBridge) link=15.75GB/s buses=0000:[18-18])
      PCI L#6 (busid=0000:18:00.0 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=6)
        Network L#4 (Address=ec:0d:9a:92:b2:46 Port=1) "ens6np0"
        OpenFabrics L#5 (NodeGUID=ec0d:9a03:0092:b246 SysImageGUID=ec0d:9a03:0092:b246 Port1State=4 Port1LID=0x0 Port1LMC=0 Port1GID0=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID1=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID2=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID3=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID4=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID5=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID6=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID7=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID8=fe80:0000:0000:0000:88dd:0692:352e:0cec Port1GID9=fe80:0000:0000:0000:88dd:0692:352e:0cec) "rocep24s0"
  HostBridge L#8 (buses=0000:[50-51])
    PCIBridge L#9 (busid=0000:50:04.0 id=8086:347c class=0604(PCIBridge) link=15.75GB/s buses=0000:[51-51])
      PCI L#7 (busid=0000:51:00.0 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4)
        Network L#6 (Address=00:00:05:f4:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0a Port=1) "ibs4f0"
        OpenFabrics L#7 (NodeGUID=b8ce:f603:0037:7a0a SysImageGUID=b8ce:f603:0037:7a0a Port1State=4 Port1LID=0xc Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0a) "ibp81s0f0"
      PCI L#8 (busid=0000:51:00.1 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4)
        Network L#8 (Address=00:00:03:d3:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0b Port=1) "ibs4f1"
        OpenFabrics L#9 (NodeGUID=b8ce:f603:0037:7a0b SysImageGUID=b8ce:f603:0037:7a0a Port1State=1 Port1LID=0xffff Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0b) "ibp81s0f1"
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        12 L2Cache (type #5)
    depth 4:       12 L1dCache (type #4)
     depth 5:      12 L1iCache (type #9)
      depth 6:     12 Core (type #2)
       depth 7:    12 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  10 Bridge (type #14)
Special depth -5:  9 PCIDev (type #15)
Special depth -6:  10 OSDev (type #16)
Memory attribute #2 name `Bandwidth' flags 5
  NUMANode L#0 = 1790 from cpuset 0x00000fff (Machine L#0)
Memory attribute #3 name `Latency' flags 6
  NUMANode L#0 = 7600 from cpuset 0x00000fff (Machine L#0)
CPU kind #0 efficiency 0 cpuset 0x00000fff
  FrequencyMaxMHz = 3300


> The following branch has pending workqueue changes which makes unbound
> workqueues finer grained by default and a lot more flexible in how they're
> segmented.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
> 
> Can you please test with the brnach? If the default doesn't improve the
> situation, you can set WQ_SYSFS on the affected workqueues and change their
> scoping by writing to /sys/devices/virtual/WQ_NAME/affinity_scope. Please
> take a look at
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v2#n350
> 
> for more details.

I will give this a try.


--
Chuck Lever






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux