On Thu, 2024-06-20 at 12:29 +1000, Dave Chinner wrote: > On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote: > > On Wed, 19 Jun 2024, NeilBrown wrote: > > > On Wed, 19 Jun 2024, Dave Chinner wrote: > > > > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > > > > > > > > On Jun 18, 2024, at 3:29 PM, Trond Myklebust > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Jun 18, 2024, at 2:32 PM, Trond Myklebust > > > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > I recently back ported Neil's lwq code and sunrpc server > > > > > > > > > > changes to > > > > > > > > > > our > > > > > > > > > > 5.15.130 based kernel in the hope of improving the performance > > > > > > > > > > for > > > > > > > > > > our > > > > > > > > > > data servers. > > > > > > > > > > > > > > > > > > > > Our performance team recently ran a fio workload on a client > > > > > > > > > > that > > > > > > > > > > was > > > > > > > > > > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection > > > > > > > > > > (infiniband) against that resulting server. I've attached the > > > > > > > > > > resulting > > > > > > > > > > flame graph from a perf profile run on the server side. > > > > > > > > > > > > > > > > > > > > Is anyone else seeing this massive contention for the spin lock > > > > > > > > > > in > > > > > > > > > > __lwq_dequeue? As you can see, it appears to be dwarfing all > > > > > > > > > > the > > > > > > > > > > other > > > > > > > > > > nfsd activity on the system in question here, being responsible > > > > > > > > > > for > > > > > > > > > > 45% > > > > > > > > > > of all the perf hits. > > > > > > > > Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock. > > > > > > > > llist_reverse_order() is an O(n) algorithm involving full length > > > > linked list traversal. IOWs, it's a worst case cache miss algorithm > > > > running under a spin lock. And then consider what happens when > > > > enqueue processing is faster than dequeue processing. > > > > > > My expectation was that if enqueue processing (incoming packets) was > > > faster than dequeue processing (handling NFS requests) then there was a > > > bottleneck elsewhere, and this one wouldn't be relevant. > > > > > > It might be useful to measure how long the queue gets. > > > > Thinking about this some more .... if it did turn out that the queue > > gets long, and maybe even if it didn't, we could reimplement lwq as a > > simple linked list with head and tail pointers. > > > > enqueue would be something like: > > > > new->next = NULL; > > old_tail = xchg(&q->tail, new); > > if (old_tail) > > /* dequeue of old_tail cannot succeed until this assignment completes */ > > old_tail->next = new > > else > > q->head = new > > > > dequeue would be > > > > spinlock() > > ret = q->head; > > if (ret) { > > while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret) > > /* wait for enqueue of q->tail to complete */ > > cpu_relax(); > > } > > cmp_xchg(&q->head, ret, ret->next); > > spin_unlock(); > > That might work, but I suspect that it's still only putting off the > inevitable. > > Doing the dequeue purely with atomic operations might be possible, > but it's not immediately obvious to me how to solve both head/tail > race conditions with atomic operations. I can work out an algorithm > that makes enqueue safe against dequeue races (or vice versa), but I > can't also get the logic on the opposite side to also be safe. > > I'll let it bounce around my head a bit more... > The latest version of the multigrain timestamp patches doesn't use it, but Jan Kara pointed out to me that there is a cmpxchg128 function in the kernel. It's only defined for some arches (x86, s390, and aarch64, I think), but maybe that could be used here. -- Jeff Layton <jlayton@xxxxxxxxxx>