Re: knfsd performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jun 19, 2024, at 10:29 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, Jun 20, 2024 at 07:25:15AM +1000, NeilBrown wrote:
>> On Wed, 19 Jun 2024, NeilBrown wrote:
>>> On Wed, 19 Jun 2024, Dave Chinner wrote:
>>>> On Tue, Jun 18, 2024 at 07:54:43PM +0000, Chuck Lever III wrote  > On Jun 18, 2024, at 3:50 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Tue, 2024-06-18 at 19:39 +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 18, 2024, at 3:29 PM, Trond Myklebust
>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> On Tue, 2024-06-18 at 18:40 +0000, Chuck Lever III wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jun 18, 2024, at 2:32 PM, Trond Myklebust
>>>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> I recently back ported Neil's lwq code and sunrpc server
>>>>>>>>>> changes to
>>>>>>>>>> our
>>>>>>>>>> 5.15.130 based kernel in the hope of improving the performance
>>>>>>>>>> for
>>>>>>>>>> our
>>>>>>>>>> data servers.
>>>>>>>>>> 
>>>>>>>>>> Our performance team recently ran a fio workload on a client
>>>>>>>>>> that
>>>>>>>>>> was
>>>>>>>>>> doing 100% NFSv3 reads in O_DIRECT mode over an RDMA connection
>>>>>>>>>> (infiniband) against that resulting server. I've attached the
>>>>>>>>>> resulting
>>>>>>>>>> flame graph from a perf profile run on the server side.
>>>>>>>>>> 
>>>>>>>>>> Is anyone else seeing this massive contention for the spin lock
>>>>>>>>>> in
>>>>>>>>>> __lwq_dequeue? As you can see, it appears to be dwarfing all
>>>>>>>>>> the
>>>>>>>>>> other
>>>>>>>>>> nfsd activity on the system in question here, being responsible
>>>>>>>>>> for
>>>>>>>>>> 45%
>>>>>>>>>> of all the perf hits.
>>>> 
>>>> Ouch. __lwq_dequeue() runs llist_reverse_order() under a spinlock.
>>>> 
>>>> llist_reverse_order() is an O(n) algorithm involving full length
>>>> linked list traversal. IOWs, it's a worst case cache miss algorithm
>>>> running under a spin lock. And then consider what happens when
>>>> enqueue processing is faster than dequeue processing.
>>> 
>>> My expectation was that if enqueue processing (incoming packets) was
>>> faster than dequeue processing (handling NFS requests) then there was a
>>> bottleneck elsewhere, and this one wouldn't be relevant.
>>> 
>>> It might be useful to measure how long the queue gets.
>> 
>> Thinking about this some more ....  if it did turn out that the queue
>> gets long, and maybe even if it didn't, we could reimplement lwq as a
>> simple linked list with head and tail pointers.
>> 
>> enqueue would be something like:
>> 
>>  new->next = NULL;
>>  old_tail = xchg(&q->tail, new);
>>  if (old_tail)
>>       /* dequeue of old_tail cannot succeed until this assignment completes */
>>       old_tail->next = new
>>  else
>>       q->head = new
>> 
>> dequeue would be
>> 
>>  spinlock()
>>  ret = q->head;
>>  if (ret) {
>>        while (ret->next == NULL && cmp_xchg(&q->tail, ret, NULL) != ret)
>>            /* wait for enqueue of q->tail to complete */
>>            cpu_relax();
>>  }
>>  cmp_xchg(&q->head, ret, ret->next);
>>  spin_unlock();
> 
> That might work, but I suspect that it's still only putting off the
> inevitable.
> 
> Doing the dequeue purely with atomic operations might be possible,
> but it's not immediately obvious to me how to solve both head/tail
> race conditions with atomic operations. I can work out an algorithm
> that makes enqueue safe against dequeue races (or vice versa), but I
> can't also get the logic on the opposite side to also be safe.
> 
> I'll let it bounce around my head a bit more...

I agree that O(n) dequeuing is potentially alarming.

Before we go too far down this path, I'd like to see
reproducible numbers that show there is a problem
when a recent upstream NFS server is properly set up
with a sensible number of threads and running a real
workload.

Otherwise there is a risk of introducing code in a
fundamental part of SunRPC that is optimized to the
point of brittleness, and for no good reason.

This is what keeps me from sleeping at night: [1]
See, it even has my name on it. :-)


--
Chuck Lever

[1] - https://www.linusakesson.net/programming/kernighans-lever/index.php




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux