Re: contention on pwq->pool->lock under heavy NFS workload

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 23 Jun 2023 14:37:17 +0000

> On Jun 22, 2023, at 3:39 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Jun 22, 2023, at 3:23 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
>> 
>> Hello,
>> 
>> On Thu, Jun 22, 2023 at 03:45:18PM +0000, Chuck Lever III wrote:
>>> The good news:
>>> 
>>> On stock 6.4-rc7:
>>> 
>>> fio 8k [r=108k,w=46.9k IOPS]
>>> 
>>> On the affinity-scopes-v2 branch (with no other tuning):
>>> 
>>> fio 8k [r=130k,w=55.9k IOPS]
>> 
>> Ah, okay, that's probably coming from per-cpu pwq. Didn't expect that to
>> make that much difference but that's nice.
> 
> "cpu" and "smt" work equally well on this system.
> 
> "cache", "numa", and "system" work equally poorly.
> 
> I have HT disabled, and there's only one NUMA node, so
> the difference here is plausible.
> 
> 
>>> The bad news:
>>> 
>>> pool->lock is still the hottest lock on the system during the test.
>>> 
>>> I'll try some of the alternate scope settings this afternoon.
>> 
>> Yeah, in your system, there's still gonna be one pool shared across all
>> CPUs. SMT or CPU may behave better but it might make sense to add a way to
>> further segment the scope so that e.g. one can split a cache domain N-ways.
> 
> If there could be more than one pool to choose from, then these
> WQs would not be hitting the same lock. Alternately, finding a
> lockless way to queue the work on a pool would be a huge win.

Following up with a few more tests.

I'm using NFS/RDMA for my test because I can drive more IOPS with it.

I've found that setting the nfsiod and rpciod workqueues to "cpu"
scope provide the best benefit for this workload. Changing the
xprtiod workqueue to "cpu" had no discernible effect.

This tracks with the number of queue_work calls for each of these
WQs. 59% of queue_work calls during the test are for the rpciod
WQ, 21% are for nfsiod, and 2% is for xprtiod.

The same test with TCP (using IP-over-IB on the same physical network)
shows no improvement on any test. That suggests there is a bottleneck
somewhere else, when using TCP, that limits its throughput.

--
Chuck Lever