Re: knfsd performance

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 18 Jun 2024 23:51:17 +0000

> On Jun 18, 2024, at 7:33 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Tue, 2024-06-18 at 23:26 +0000, Chuck Lever III wrote:
>> 
>>> On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>> On Wed, 19 Jun 2024, Jeff Layton wrote:
>>>> On Tue, 2024-06-18 at 19:54 +0000, Chuck Lever III wrote:
>>>>> 
>>>>> 
>>>>>> On Jun 18, 2024, at 3:50 PM, Trond Myklebust
>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
>>>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> I recently back ported Neil's lwq code and sunrpc server
>>>>>>>>>> changes to
>>>>>>>>>> our
>>>>>>>>>> 5.15.130 based kernel in the hope of improving the
>>>>>>>>>> performance
>>>>>>>>>> for
>>>>>>>>>> our
>>>>>>>>>> data servers.
>>>>>>>>>> 
>>>>>>>>>> Our performance team recently ran a fio workload on a
>>>>>>>>>> client
>>>>>>>>>> that
>>>>>>>>>> was
>>>>>>>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA
>>>>>>>>>> connection
>>>>>>>>>> (infiniband) against that resulting server. I've attached
>>>>>>>>>> the
>>>>>>>>>> resulting
>>>>>>>>>> flame graph from a perf profile run on the server side.
>>>>>>>>>> 
>>>>>>>>>> Is anyone else seeing this massive contention for the spin
>>>>>>>>>> lock
>>>>>>>>>> in
>>>>>>>>>> __lwq_dequeue? As you can see, it appears to be dwarfing
>>>>>>>>>> all
>>>>>>>>>> the
>>>>>>>>>> other
>>>>>>>>>> nfsd activity on the system in question here, being
>>>>>>>>>> responsible
>>>>>>>>>> for
>>>>>>>>>> 45%
>>>>>>>>>> of all the perf hits.
>>>>>>>>> 
>>>>>>>>> I haven't seen that, but I've been working on other issues.
>>>>>>>>> 
>>>>>>>>> What's the nfsd thread count on your test server? Have you
>>>>>>>>> seen a similar impact on 6.10 kernels ?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 640 knfsd threads. The machine was a supermicro 2029BT-HNR with
>>>>>>>> 2xIntel
>>>>>>>> 6150, 384GB of memory and 6xWDC SN840.
>>>>>>>> 
>>>>>>>> Unfortunately, the machine was a loaner, so cannot compare to
>>>>>>>> 6.10.
>>>>>>>> That's why I was asking if anyone has seen anything similar.
>>>>>>> 
>>>>>>> If this system had more than one NUMA node, then using
>>>>>>> svc's "numa pool" mode might have helped.
>>>>>>> 
>>>>>> 
>>>>>> Interesting. I had forgotten about that setting.
>>>>>> 
>>>>>> Just out of curiosity, is there any reason why we might not want to
>>>>>> default to that mode on a NUMA enabled system?
>>>>> 
>>>>> Can't think of one off hand. Maybe back in the day it was
>>>>> hard to tell when you were actually /on/ a NUMA system.
>>>>> 
>>>>> Copying Dave to see if he has any recollection.
>>>>> 
>>>> 
>>>> It's at least partly because of the klunkiness of the old pool_threads
>>>> interface: You have to bring up the server first using the "threads"
>>>> procfile, and then you can actually bring up threads in the various
>>>> pools using pool_threads.
>>>> 
>>>> Same for shutdown. You have to bring down the pool_threads first and
>>>> then you can bring down the final thread and the rest of the server
>>>> with it. Why it was designed this way, I have NFC.
>>>> 
>>>> The new nfsdctl tool and netlink interfaces should make this simpler in
>>>> the future. You'll be able to set the pool-mode in /etc/nfs.conf and
>>>> configure a list of per-pool thread counts in there too. Once we have
>>>> that, I think we'll be in a better position to consider doing it by
>>>> default.
>>>> 
>>>> Eventually we'd like to make the thread poos dynamic, at which point
>>>> making that the default becomes much simpler from an administrative
>>>> standpoint.
>>> 
>>> I agree that dynamic thread pools will make numa management simpler.
>>> Greg Banks did the numa work for SGI - I wonder where he is now.  He was
>>> at fastmail 10 years ago..
>> 
>> Dave (cc'd) designed it with Greg, Greg implemented it.
>> 
>> 
>>> The idea was to bind network interfaces to numa nodes with interrupt
>>> routing.  There was no expectation that work would be distributed evenly
>>> across all nodes. Some might be dedicated to non-nfs work.  So there was
>>> expected to be non-trivial configuration for both IRQ routing and
>>> threads-per-node.  If we can make threads-per-node demand-based, then
>>> half the problem goes away.
>> 
>> Network devices (and storage devices) are affined to one
>> NUMA node. If the nfsd threads are not on the same node
>> as the network device, there is a significant penalty.
>> 
>> I have a two-node system here, and it performs consistently
>> well when I put it in pool-mode=numa and affine the network
>> device's IRQs to one node.
>> 
>> It even works with two network devices (one per node) --
>> each device gets its own set of nfsd threads.
>> 
>> I don't think the pool_mode needs to be demand based. If
>> the system is a NUMA system, it makes sense to split up
>> the thread pools and put our pencils down. The only other
>> step that is needed is proper IRQ affinity settings for
>> the network devices.
>> 
> 
> Having them be demand-based is a nice-to-have. Right now, you need to
> know how many thread pools (it's not always trivial to tell) and you
> have and decide how many threads each gets. There is some cost to
> getting that wrong too.

I misread Neil's suggestion. I thought he meant that
the pool_mode setting would be demand-based. I don't
have a problem with a demand-based thread count, and
the demand should be estimated per pool, IMO.

The current pool_mode=numa setting knows how many
pools to set up: it's the same number as the there
are NUMA nodes on the system. That probably should
not be changed. nfsdctl can look at the pool_mode
setting and know how many pools there are, can't it?

> An on-demand thread pool takes a lot of the guesswork out of the
> equation (assuming we can get the behavior right, of course).
> 
>> 
>>> We could even default to one-thread-pool-per-CPU if there are more than
>>> X cpus....
>> 
>> I've never seen a performance improvement in the per-cpu
>> pool mode, fwiw.
>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
Chuck Lever