Re: knfsd performance

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 18 Jun 2024 23:26:22 +0000

> On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote:
> 
> On Wed, 19 Jun 2024, Jeff Layton wrote:
>> On Tue, 2024-06-18 at 19:54 +0000, Chuck Lever III wrote:
>>> 
>>> 
>>>> On Jun 18, 2024, at 3:50 PM, Trond Myklebust
>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
>>>>> 
>>>>> 
>>>>>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> I recently back ported Neil's lwq code and sunrpc server
>>>>>>>> changes to
>>>>>>>> our
>>>>>>>> 5.15.130 based kernel in the hope of improving the
>>>>>>>> performance
>>>>>>>> for
>>>>>>>> our
>>>>>>>> data servers.
>>>>>>>> 
>>>>>>>> Our performance team recently ran a fio workload on a
>>>>>>>> client
>>>>>>>> that
>>>>>>>> was
>>>>>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA
>>>>>>>> connection
>>>>>>>> (infiniband) against that resulting server. I've attached
>>>>>>>> the
>>>>>>>> resulting
>>>>>>>> flame graph from a perf profile run on the server side.
>>>>>>>> 
>>>>>>>> Is anyone else seeing this massive contention for the spin
>>>>>>>> lock
>>>>>>>> in
>>>>>>>> __lwq_dequeue? As you can see, it appears to be dwarfing
>>>>>>>> all
>>>>>>>> the
>>>>>>>> other
>>>>>>>> nfsd activity on the system in question here, being
>>>>>>>> responsible
>>>>>>>> for
>>>>>>>> 45%
>>>>>>>> of all the perf hits.
>>>>>>> 
>>>>>>> I haven't seen that, but I've been working on other issues.
>>>>>>> 
>>>>>>> What's the nfsd thread count on your test server? Have you
>>>>>>> seen a similar impact on 6.10 kernels ?
>>>>>>> 
>>>>>> 
>>>>>> 640 knfsd threads. The machine was a supermicro 2029BT-HNR with
>>>>>> 2xIntel
>>>>>> 6150, 384GB of memory and 6xWDC SN840.
>>>>>> 
>>>>>> Unfortunately, the machine was a loaner, so cannot compare to
>>>>>> 6.10.
>>>>>> That's why I was asking if anyone has seen anything similar.
>>>>> 
>>>>> If this system had more than one NUMA node, then using
>>>>> svc's "numa pool" mode might have helped.
>>>>> 
>>>> 
>>>> Interesting. I had forgotten about that setting.
>>>> 
>>>> Just out of curiosity, is there any reason why we might not want to
>>>> default to that mode on a NUMA enabled system?
>>> 
>>> Can't think of one off hand. Maybe back in the day it was
>>> hard to tell when you were actually /on/ a NUMA system.
>>> 
>>> Copying Dave to see if he has any recollection.
>>> 
>> 
>> It's at least partly because of the klunkiness of the old pool_threads
>> interface: You have to bring up the server first using the "threads"
>> procfile, and then you can actually bring up threads in the various
>> pools using pool_threads.
>> 
>> Same for shutdown. You have to bring down the pool_threads first and
>> then you can bring down the final thread and the rest of the server
>> with it. Why it was designed this way, I have NFC.
>> 
>> The new nfsdctl tool and netlink interfaces should make this simpler in
>> the future. You'll be able to set the pool-mode in /etc/nfs.conf and
>> configure a list of per-pool thread counts in there too. Once we have
>> that, I think we'll be in a better position to consider doing it by
>> default.
>> 
>> Eventually we'd like to make the thread poos dynamic, at which point
>> making that the default becomes much simpler from an administrative
>> standpoint.
> 
> I agree that dynamic thread pools will make numa management simpler.
> Greg Banks did the numa work for SGI - I wonder where he is now.  He was
> at fastmail 10 years ago..

Dave (cc'd) designed it with Greg, Greg implemented it.

> The idea was to bind network interfaces to numa nodes with interrupt
> routing.  There was no expectation that work would be distributed evenly
> across all nodes. Some might be dedicated to non-nfs work.  So there was
> expected to be non-trivial configuration for both IRQ routing and
> threads-per-node.  If we can make threads-per-node demand-based, then
> half the problem goes away.

Network devices (and storage devices) are affined to one
NUMA node. If the nfsd threads are not on the same node
as the network device, there is a significant penalty.

I have a two-node system here, and it performs consistently
well when I put it in pool-mode=numa and affine the network
device's IRQs to one node.

It even works with two network devices (one per node) --
each device gets its own set of nfsd threads.

I don't think the pool_mode needs to be demand based. If
the system is a NUMA system, it makes sense to split up
the thread pools and put our pencils down. The only other
step that is needed is proper IRQ affinity settings for
the network devices.

> We could even default to one-thread-pool-per-CPU if there are more than
> X cpus....

I've never seen a performance improvement in the per-cpu
pool mode, fwiw.

--
Chuck Lever