On Thu, 20 Jun 2024, Jeff Layton wrote: > On Thu, 2024-06-20 at 12:29 +1000, Dave Chinner wrote: > > On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote: > > > On Wed, 19 Jun 2024, NeilBrown wrote: > > > > On Wed, 19 Jun 2024, Dave Chinner wrote: > > > > > On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > > > > > > > > > > > On Jun 18, 2024, at 3:29 PM, Trond Myklebust > > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Jun 18, 2024, at 2:32 PM, Trond Myklebust > > > > > > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > I recently back ported Neil's lwq code and sunrpc server > > > > > > > > > > > changes to > > > > > > > > > > > our > > > > > > > > > > > 5.15.130 based kernel in the hope of improving the performance > > > > > > > > > > > for > > > > > > > > > > > our > > > > > > > > > > > data servers. > > > > > > > > > > > > > > > > > > > > > > Our performance team recently ran a fio workload on a client > > > > > > > > > > > that > > > > > > > > > > > was > > > > > > > > > > > doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection > > > > > > > > > > > (infiniband) against that resulting server. I've attached the > > > > > > > > > > > resulting > > > > > > > > > > > flame graph from a perf profile run on the server side. > > > > > > > > > > > > > > > > > > > > > > Is anyone else seeing this massive contention for the spin lock > > > > > > > > > > > in > > > > > > > > > > > __lwq_dequeue? As you can see, it appears to be dwarfing all > > > > > > > > > > > the > > > > > > > > > > > other > > > > > > > > > > > nfsd activity on the system in question here, being responsible > > > > > > > > > > > for > > > > > > > > > > > 45% > > > > > > > > > > > of all the perf hits. > > > > > > > > > > Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock. > > > > > > > > > > llist_reverse_order() is an O(n) algorithm involving full length > > > > > linked list traversal. IOWs, it's a worst case cache miss algorithm > > > > > running under a spin lock. And then consider what happens when > > > > > enqueue processing is faster than dequeue processing. > > > > > > > > My expectation was that if enqueue processing (incoming packets) was > > > > faster than dequeue processing (handling NFS requests) then there was a > > > > bottleneck elsewhere, and this one wouldn't be relevant. > > > > > > > > It might be useful to measure how long the queue gets. > > > > > > Thinking about this some more .... if it did turn out that the queue > > > gets long, and maybe even if it didn't, we could reimplement lwq as a > > > simple linked list with head and tail pointers. > > > > > > enqueue would be something like: > > > > > > new->next = NULL; > > > old_tail = xchg(&q->tail, new); > > > if (old_tail) > > > /* dequeue of old_tail cannot succeed until this assignment completes */ > > > old_tail->next = new > > > else > > > q->head = new > > > > > > dequeue would be > > > > > > spinlock() > > > ret = q->head; > > > if (ret) { > > > while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret) > > > /* wait for enqueue of q->tail to complete */ > > > cpu_relax(); > > > } > > > cmp_xchg(&q->head, ret, ret->next); > > > spin_unlock(); > > > > That might work, but I suspect that it's still only putting off the > > inevitable. > > > > Doing the dequeue purely with atomic operations might be possible, > > but it's not immediately obvious to me how to solve both head/tail > > race conditions with atomic operations. I can work out an algorithm > > that makes enqueue safe against dequeue races (or vice versa), but I > > can't also get the logic on the opposite side to also be safe. > > > > I'll let it bounce around my head a bit more... > > > > The latest version of the multigrain timestamp patches doesn't use it, > but Jan Kara pointed out to me that there is a cmpxchg128 function in > the kernel. It's only defined for some arches (x86, s390, and aarch64, > I think), but maybe that could be used here. One of the particularly pernicious scenarios that a non-locking approach needs to deal with is if two threads try to dequeue at the same time and just before the atomic ops which commits the dequeue, one of the threads is suspends - e.g. by an interrupt or VM pauses or whatever. While that thread is paused the other thread completes the dequeue, consumes the event, recycles the object, and it ends up back at the start of the queue, though this time with a different ->next pointer. The paused thread then wakes up and the cmpxchg succeeds but does the wrong thing. This can be addressed with ll/sc because the sc will always fail after a long pause like that. With cmpxchg we could fix it by including a large sequence counter with the address in the atomic exchange. That is where cmpxchg128 comes in. So: head = q->head ctr = q->ctr new = head->next if (cmpxchg128(q, [head,ctr] [new,ctr+1]) fails) goto So; would avoid that problem as the ctr would be different - and a 64 bit counter never cycles. This only solves that particular issue. There are other more obvious races but I think it could be made to work. We'd need a spinlock fallback of other archs of course. Thanks, NeilBrown