On Wed, 19 Jun 2024, Dave Chinner wrote: > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote: > > >> > > >> > > >>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust > > >>> <trondmy@xxxxxxxxxxxxxxx> wrote: > > >>> > > >>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote: > > >>>> > > >>>> > > >>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust > > >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: > > >>>>> > > >>>>> I recently back ported Neil's lwq code and sunrpc server > > >>>>> changes to > > >>>>> our > > >>>>> 5.15.130 based kernel in the hope of improving the performance > > >>>>> for > > >>>>> our > > >>>>> data servers. > > >>>>> > > >>>>> Our performance team recently ran a fio workload on a client > > >>>>> that > > >>>>> was > > >>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection > > >>>>> (infiniband) against that resulting server. I've attached the > > >>>>> resulting > > >>>>> flame graph from a perf profile run on the server side. > > >>>>> > > >>>>> Is anyone else seeing this massive contention for the spin lock > > >>>>> in > > >>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all > > >>>>> the > > >>>>> other > > >>>>> nfsd activity on the system in question here, being responsible > > >>>>> for > > >>>>> 45% > > >>>>> of all the perf hits. > > Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock. > > llist_reverse_order() is an O(n) algorithm involving full length > linked list traversal. IOWs, it's a worst case cache miss algorithm > running under a spin lock. And then consider what happens when > enqueue processing is faster than dequeue processing. My expectation was that if enqueue processing (incoming packets) was faster than dequeue processing (handling NFS requests) then there was a bottleneck elsewhere, and this one wouldn't be relevant. It might be useful to measure how long the queue gets. > > This means the depth of the queue grows, and ultimate length of the > queue is unbound. Because fo the batch processing nature of lwq - > it takes ->new, reverses it and places it in ->ready - the length of > the list that needs reversing ends up growing every batch that > we queue faster than we dequeue. Unbound processing queues are bad > even when they have O(1) behaviour. lwq has O(n) worst case > behaviour, and that makes this even worse... > > Regardless, The current lwq could be slightly improved - the > lockless enqueue competes for the same cacheline as the dequeue > serialisation lock. > > struct lwq { > spinlock_t lock; > struct llist_node *ready; /* entries to be dequeued */ > struct llist_head new; /* entries being enqueued */ > }; > > Adding __cacheline_aligned_in_smp to ->new (the enqueue side) might > help reduce this enqueue/dequeue cacheline contention a bit by > separating them onto different cachelines. That will push the point > of catastrophic breakdown out a little bit, not solve the issue of > queue depth based batch processing on the dequeue side. > Yes, that might be beneficial - thanks. > I suspect a lockless ring buffer might be a more scalable solution > for the nfsd... We would need to size the buffer to the maximum number of connections (if there is one - I don't recall) or resize the buffer as the number of connections grows. Certainly doable. Thanks, NeilBrown