> On Jun 19, 2024, at 10:29 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote: >> On Wed, 19 Jun 2024, NeilBrown wrote: >>> On Wed, 19 Jun 2024, Dave Chinner wrote: >>>> On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote: >>>>>>> >>>>>>> >>>>>>>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust >>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust >>>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> I recently back ported Neil's lwq code and sunrpc server >>>>>>>>>> changes to >>>>>>>>>> our >>>>>>>>>> 5.15.130 based kernel in the hope of improving the performance >>>>>>>>>> for >>>>>>>>>> our >>>>>>>>>> data servers. >>>>>>>>>> >>>>>>>>>> Our performance team recently ran a fio workload on a client >>>>>>>>>> that >>>>>>>>>> was >>>>>>>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection >>>>>>>>>> (infiniband) against that resulting server. I've attached the >>>>>>>>>> resulting >>>>>>>>>> flame graph from a perf profile run on the server side. >>>>>>>>>> >>>>>>>>>> Is anyone else seeing this massive contention for the spin lock >>>>>>>>>> in >>>>>>>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all >>>>>>>>>> the >>>>>>>>>> other >>>>>>>>>> nfsd activity on the system in question here, being responsible >>>>>>>>>> for >>>>>>>>>> 45% >>>>>>>>>> of all the perf hits. >>>> >>>> Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock. >>>> >>>> llist_reverse_order() is an O(n) algorithm involving full length >>>> linked list traversal. IOWs, it's a worst case cache miss algorithm >>>> running under a spin lock. And then consider what happens when >>>> enqueue processing is faster than dequeue processing. >>> >>> My expectation was that if enqueue processing (incoming packets) was >>> faster than dequeue processing (handling NFS requests) then there was a >>> bottleneck elsewhere, and this one wouldn't be relevant. >>> >>> It might be useful to measure how long the queue gets. >> >> Thinking about this some more .... if it did turn out that the queue >> gets long, and maybe even if it didn't, we could reimplement lwq as a >> simple linked list with head and tail pointers. >> >> enqueue would be something like: >> >> new->next = NULL; >> old_tail = xchg(&q->tail, new); >> if (old_tail) >> /* dequeue of old_tail cannot succeed until this assignment completes */ >> old_tail->next = new >> else >> q->head = new >> >> dequeue would be >> >> spinlock() >> ret = q->head; >> if (ret) { >> while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret) >> /* wait for enqueue of q->tail to complete */ >> cpu_relax(); >> } >> cmp_xchg(&q->head, ret, ret->next); >> spin_unlock(); > > That might work, but I suspect that it's still only putting off the > inevitable. > > Doing the dequeue purely with atomic operations might be possible, > but it's not immediately obvious to me how to solve both head/tail > race conditions with atomic operations. I can work out an algorithm > that makes enqueue safe against dequeue races (or vice versa), but I > can't also get the logic on the opposite side to also be safe. > > I'll let it bounce around my head a bit more... I agree that O(n) dequeuing is potentially alarming. Before we go too far down this path, I'd like to see reproducible numbers that show there is a problem when a recent upstream NFS server is properly set up with a sensible number of threads and running a real workload. Otherwise there is a risk of introducing code in a fundamental part of SunRPC that is optimized to the point of brittleness, and for no good reason. This is what keeps me from sleeping at night: [1] See, it even has my name on it. :-) -- Chuck Lever [1] - https://www.linusakesson.net/programming/kernighans-lever/index.php