> On Jun 18, 2024, at 7:33 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Tue, 2024-06-18 at 23:26 +0000, Chuck Lever III wrote: >> >>> On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote: >>> >>> On Wed, 19 Jun 2024, Jeff Layton wrote: >>>> On Tue, 2024-06-18 at 19:54 +0000, Chuck Lever III wrote: >>>>> >>>>> >>>>>> On Jun 18, 2024, at 3:50 PM, Trond Myklebust >>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote: >>>>>>> >>>>>>> >>>>>>>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust >>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust >>>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> I recently back ported Neil's lwq code and sunrpc server >>>>>>>>>> changes to >>>>>>>>>> our >>>>>>>>>> 5.15.130 based kernel in the hope of improving the >>>>>>>>>> performance >>>>>>>>>> for >>>>>>>>>> our >>>>>>>>>> data servers. >>>>>>>>>> >>>>>>>>>> Our performance team recently ran a fio workload on a >>>>>>>>>> client >>>>>>>>>> that >>>>>>>>>> was >>>>>>>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA >>>>>>>>>> connection >>>>>>>>>> (infiniband) against that resulting server. I've attached >>>>>>>>>> the >>>>>>>>>> resulting >>>>>>>>>> flame graph from a perf profile run on the server side. >>>>>>>>>> >>>>>>>>>> Is anyone else seeing this massive contention for the spin >>>>>>>>>> lock >>>>>>>>>> in >>>>>>>>>> __lwq_dequeue? As you can see, it appears to be dwarfing >>>>>>>>>> all >>>>>>>>>> the >>>>>>>>>> other >>>>>>>>>> nfsd activity on the system in question here, being >>>>>>>>>> responsible >>>>>>>>>> for >>>>>>>>>> 45% >>>>>>>>>> of all the perf hits. >>>>>>>>> >>>>>>>>> I haven't seen that, but I've been working on other issues. >>>>>>>>> >>>>>>>>> What's the nfsd thread count on your test server? Have you >>>>>>>>> seen a similar impact on 6.10 kernels ? >>>>>>>>> >>>>>>>> >>>>>>>> 640 knfsd threads. The machine was a supermicro 2029BT-HNR with >>>>>>>> 2xIntel >>>>>>>> 6150, 384GB of memory and 6xWDC SN840. >>>>>>>> >>>>>>>> Unfortunately, the machine was a loaner, so cannot compare to >>>>>>>> 6.10. >>>>>>>> That's why I was asking if anyone has seen anything similar. >>>>>>> >>>>>>> If this system had more than one NUMA node, then using >>>>>>> svc's "numa pool" mode might have helped. >>>>>>> >>>>>> >>>>>> Interesting. I had forgotten about that setting. >>>>>> >>>>>> Just out of curiosity, is there any reason why we might not want to >>>>>> default to that mode on a NUMA enabled system? >>>>> >>>>> Can't think of one off hand. Maybe back in the day it was >>>>> hard to tell when you were actually /on/ a NUMA system. >>>>> >>>>> Copying Dave to see if he has any recollection. >>>>> >>>> >>>> It's at least partly because of the klunkiness of the old pool_threads >>>> interface: You have to bring up the server first using the "threads" >>>> procfile, and then you can actually bring up threads in the various >>>> pools using pool_threads. >>>> >>>> Same for shutdown. You have to bring down the pool_threads first and >>>> then you can bring down the final thread and the rest of the server >>>> with it. Why it was designed this way, I have NFC. >>>> >>>> The new nfsdctl tool and netlink interfaces should make this simpler in >>>> the future. You'll be able to set the pool-mode in /etc/nfs.conf and >>>> configure a list of per-pool thread counts in there too. Once we have >>>> that, I think we'll be in a better position to consider doing it by >>>> default. >>>> >>>> Eventually we'd like to make the thread poos dynamic, at which point >>>> making that the default becomes much simpler from an administrative >>>> standpoint. >>> >>> I agree that dynamic thread pools will make numa management simpler. >>> Greg Banks did the numa work for SGI - I wonder where he is now. He was >>> at fastmail 10 years ago.. >> >> Dave (cc'd) designed it with Greg, Greg implemented it. >> >> >>> The idea was to bind network interfaces to numa nodes with interrupt >>> routing. There was no expectation that work would be distributed evenly >>> across all nodes. Some might be dedicated to non-nfs work. So there was >>> expected to be non-trivial configuration for both IRQ routing and >>> threads-per-node. If we can make threads-per-node demand-based, then >>> half the problem goes away. >> >> Network devices (and storage devices) are affined to one >> NUMA node. If the nfsd threads are not on the same node >> as the network device, there is a significant penalty. >> >> I have a two-node system here, and it performs consistently >> well when I put it in pool-mode=numa and affine the network >> device's IRQs to one node. >> >> It even works with two network devices (one per node) -- >> each device gets its own set of nfsd threads. >> >> I don't think the pool_mode needs to be demand based. If >> the system is a NUMA system, it makes sense to split up >> the thread pools and put our pencils down. The only other >> step that is needed is proper IRQ affinity settings for >> the network devices. >> > > Having them be demand-based is a nice-to-have. Right now, you need to > know how many thread pools (it's not always trivial to tell) and you > have and decide how many threads each gets. There is some cost to > getting that wrong too. I misread Neil's suggestion. I thought he meant that the pool_mode setting would be demand-based. I don't have a problem with a demand-based thread count, and the demand should be estimated per pool, IMO. The current pool_mode=numa setting knows how many pools to set up: it's the same number as the there are NUMA nodes on the system. That probably should not be changed. nfsdctl can look at the pool_mode setting and know how many pools there are, can't it? > An on-demand thread pool takes a lot of the guesswork out of the > equation (assuming we can get the behavior right, of course). > >> >>> We could even default to one-thread-pool-per-CPU if there are more than >>> X cpus.... >> >> I've never seen a performance improvement in the per-cpu >> pool mode, fwiw. >> >> >> -- >> Chuck Lever >> >> > > -- > Jeff Layton <jlayton@xxxxxxxxxx> -- Chuck Lever