Re: knfsd performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jun 18, 2024, at 10:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jun 18, 2024 at 11:26:22PM +0000, Chuck Lever III wrote:
>>> On Jun 18, 2024, at 7:17 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>>> Eventually we'd like to make the thread poos dynamic, at which point
>>>> making that the default becomes much simpler from an administrative
>>>> standpoint.
>>> 
>>> I agree that dynamic thread pools will make numa management simpler.
>>> Greg Banks did the numa work for SGI - I wonder where he is now.  He was
>>> at fastmail 10 years ago..
>> 
>> Dave (cc'd) designed it with Greg, Greg implemented it.
> 
> [ I'll dump a bit of history about the NUMA nfsd architecture at the
> end. ]
> 
>>> The idea was to bind network interfaces to numa nodes with interrupt
>>> routing.  There was no expectation that work would be distributed evenly
>>> across all nodes. Some might be dedicated to non-nfs work.  So there was
>>> expected to be non-trivial configuration for both IRQ routing and
>>> threads-per-node.  If we can make threads-per-node demand-based, then
>>> half the problem goes away.
> 
> Right.
> 
> For the dynamic thread pool stuff, the grow side was a simple
> heuristic: when we dequeued a request, we checked if the request
> queue was empty, if there were idle nfsd threads and whether we were
> under the max thread count.  i.e. If we had more work to do and no
> idle workers to do it, we forked another nfsd thread to do the work.
> 
> I don't recall exactly what Greg implemented on Linux for the shrink
> side. On Irix, the nfsd would record the time at which it completed
> it's last request processing, we fired a timer every 30s or
> so to walk the nfsd status array. If we found an nfsd with a
> completion time older than 30s, the nfsd got reaped. 30s was long
> enough to handle bursty loads, but short enough that people didn't
> complain about having hundreds of nfsds sitting around....
> 
> This is basically a very simple version of what workqueues do for us
> now.
> 
> That is, if we just make the nfsd request work be based on per-node, node
> affine, unbound work queues, then thread scaling comes along for
> free. I think that workqueues support this per-node thread pool
> affinity natively now:
> 
> enum wq_affn_scope {
>        WQ_AFFN_DFL,                    /* use system default */
>        WQ_AFFN_CPU,                    /* one pod per CPU */
>        WQ_AFFN_SMT,                    /* one pod poer SMT */
>        WQ_AFFN_CACHE,                  /* one pod per LLC */
>>>>>>  WQ_AFFN_NUMA,                   /* one pod per NUMA node */
>        WQ_AFFN_SYSTEM,                 /* one pod across the whole system */
> 
>        WQ_AFFN_NR_TYPES,
> };

Note that wq_affn_scope appeared only in the last one
or two kernel releases. Previously there was no way to
control workqueue cache affinity.

I have observed significant improvement in throughput
scalability on the Linux NFS client with the new default
workqueue cache affinity behavior (with NFS/RDMA, of
course). The issue is, interestingly enough, the same
as it was for lwq: spinlock contention is reduced once
the set of threads being scheduled all belong to the
same hardware cache.


> I'm not sure that the NFS server needs to reinvent the wheel here...

Jeff implemented a workqueue-based svc thread scheduler
about 10 years ago. We could never make it perform as
well as the existing kthreads-based scheduler. I'm not
sure we need a work-conserving thread scheduler for
NFSD/SunRPC.

That's why we took the route of keeping the bespoke
thread scheduler in SunRPC this time.


>> Network devices (and storage devices) are affined to one
>> NUMA node.
> 
> NVMe storage devices don't need to be affine to the node. They just
> need to have a hardware queue assigned to each node so that
> node-local IO always hits the same hardware queue and gets
> completion interrupts returned to that same node.
> 
> And, yes, this is something that still has to be configured
> manually, too.
> 
>> If the nfsd threads are not on the same node
>> as the network device, there is a significant penalty.
>> 
>> I have a two-node system here, and it performs consistently
>> well when I put it in pool-mode=numa and affine the network
>> device's IRQs to one node.
>> 
>> It even works with two network devices (one per node) --
>> each device gets its own set of nfsd threads.
> 
> Right. But this is all orthogonal to solving the problem of demand
> based thread pool scaling.

True, but it /is/ related to Trond's initial observation.
The network devices need to be affined properly before
pool_mode=numa is completely effective.


>> I don't think the pool_mode needs to be demand based. If
>> the system is a NUMA system, it makes sense to split up
>> the thread pools and put our pencils down. The only other
>> step that is needed is proper IRQ affinity settings for
>> the network devices.
> 
> I think it's better for everyone if the system automatically scales
> with demand, regardless of whether it's a NUMA system or not, and
> regardless of whether the proper NUMA affinity has been configured
> or not.

Read the mistakes that I wrote here ;-) I wasn't referring
to thread count. As I responded later, I think automatically
managing the pool thread count would be a good improvement.
We've been discussing this off and on for a while.


>>> We could even default to one-thread-pool-per-CPU if there are more than
>>> X cpus....
>> 
>> I've never seen a performance improvement in the per-cpu
>> pool mode, fwiw.
> 
> We're not doing anything inherently per-cpu in processing an NFS
> request, so I can only see downsides to trying to restrict incoming
> processing to per-cpu queues. i.e. if the CPU can't handle all the
> incoming requests, what processes the per-cpu request backlog? At
> least with per-node queues, we have many cpus to through at the one
> incoming request queue and we are much less likely to get backlogged
> and starve the request queue of processing resources...

I've proposed getting rid of pool_mode=cpu, because I can't
find a winning use case for it. We erred on the conservative
side and left it in place for now.

In terms of making pool_mode=numa the default setting, that
will work in many cases, but in some cases it will result in
unpleasant surprises.

The default out-of-the-shrink-wrap thread count is currently
fixed at 8.

In pool_mode=global, all 8 threads get used.

In pool_mode=numa, the total thread count is split across
the nodes. On a single node system, there will be no
change. All is well.

On a two-node system, though, each pool will get 4 threads,
or only 2 threads each on a four-node system. That can have
a surprising and undesirable performance impact.

I don't know what will happen on a system with more nodes
than nfsd threads.

Thus I think there needs to be a solution (like dynamic
thread count adjustment) in place to address thread count
splitting before we can change the default pool mode.


--
Chuck Lever






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux