Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Guo Ren <guoren@xxxxxxxxxx> · Fri, 4 Aug 2023 10:17:35 -0400

On Fri, Aug 4, 2023 at 4:26 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Aug 04, 2023 at 09:33:48AM +0800, Guo Ren wrote:
> > On Thu, Aug 3, 2023 at 7:57 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > > CNA should only show a benefit when there is strong inter-node
> > > contention, and in that case it is typically best to fix the kernel side
> > > locking.
> > >
> > > Hence the question as to what lock prompted you to look at this.
> > I met the long lock queue situation when the hardware gave an overly
> > aggressive store queue merge buffer delay mechanism. See:
> > https://lore.kernel.org/linux-riscv/20230802164701.192791-8-guoren@xxxxxxxxxx/
>
> *groan*, so you're using it to work around 'broken' hardware :-(
Yes, the hardware needs to be improved, and it couldn't depend on
WRITE_ONCE() hack. But from another view, if we tell the hardware this
is a WRITE_ONCE(), this store instruction should be observed by
sibling cores immediately, then the hardware could optimize the
behavior in the store queue. (All modern processors have a store queue
beyond the cache, and there is latency between the store queue and the
cache.) So:

How about adding an instruction of "st.aqrl" for WRITE_ONCE()? Which
makes the WRITE_ONCE() become the RCsc synchronization point.

>
> Wouldn't that hardware have horrifically bad lock throughput anyway?
> Everybody would end up waiting on that store buffer delay.
This problem is only found in the lock torture case, and only one
entry left in the store buffer would cause the problem. We are now
widely stress testing userspace parallel applications to find a second
case. Yes, we must be careful to treat this.

>
> > This also let me consider improving the efficiency of the long lock
> > queue release. For example, if the queue is like this:
> >
> > c -> c -> (Node0 cpu1) -> (Node1 cpu65) ->
> > (Node0 cpu2) -> (Node1 cpu66) -> ...
> >
> > Then every mcs_unlock would cause a cross-NUMA c. But if we
> > could make the queue like this:
>
> See, this is where the ARM64 WFE would come in handy; I don't suppose
> RISC-V has anything like that?
Em... arm64 smp_cond_load only could save power consumption or release
the pipeline resources of an SMT processor. When (Node1 cpu64) is in
the WFE state, it still needs (Node0 cpu1) to write the value to give
a cross-NUMA signal. So I didn't see what WFE related to reducing
cross-Numa transactions, or I missed something. Sorry

>
> Also, by the time you have 6 waiters, I'd say the lock is terribly
> contended and you should look at improving the lockinh scheme.
--
Best Regards
 Guo Ren