Re: [RFC 0/2] srcu: Remove pre-flip memory barrier

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Tue, 20 Dec 2022 13:29:02 -0500

> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> 
> On 2022-12-20 13:05, Joel Fernandes wrote:
>> Hi Mathieu,
>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
>>> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>> 
>>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>> [...]
>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this need to
>>>>>> be exactly 2^64 read-side critical sections.
>>>>> 
>>>>> Yes, but what about 32-bit systems?
>>> 
>>> The overflow indeed happens after 2^32 increments, just like seqlock.
>>> The question we need to ask is therefore: if 2^32 is good enough for
>>> seqlock, why isn't it good enough for SRCU ?
>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll
>> let him talk more about it. If 32-bit is good enough, let us also drop
>> the size of the counters for 64-bit then?
>>>>>> There are other synchronization algorithms such as seqlocks which are
>>>>>> quite happy with much less protection against overflow (using a 32-bit
>>>>>> counter even on 64-bit architectures).
>>>>> 
>>>>> The seqlock is an interesting point.
>>>>> 
>>>>>> For practical purposes, I suspect this issue is really just theoretical.
>>>>> 
>>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>>> active readers? Is the issue about grace period delay or performance?
>>>>> If so, it might be worth prototyping that approach and measuring using
>>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>>> approach, then IMO it is worth exploring.
>>> 
>>> The main benefit I expect is improved performance of the grace period
>>> implementation in common cases where there are few or no readers
>>> present, especially on machines with many cpus.
>>> 
>>> It allows scanning both periods (0/1) for each cpu within the same pass,
>>> therefore loading both period's unlock counters sitting in the same
>>> cache line at once (improved locality), and then loading both period's
>>> lock counters, also sitting in the same cache line.
>>> 
>>> It also allows skipping the period flip entirely if there are no readers
>>> present, which is an -arguably- tiny performance improvement as well.
>> The issue of counter wrap aside, what if a new reader always shows up
>> in the active index being scanned, then can you not delay the GP
>> indefinitely? It seems like writer-starvation is possible then (sure
>> it is possible also with preemption after reader-index-sampling, but
>> scanning active index deliberately will make that worse). Seqlock does
>> not have such writer starvation just because the writer does not care
>> about what the readers are doing.
> 
> No, it's not possible for "current index" readers to starve the g.p. with the side-rcu scheme, because the initial pass (sampling both periods) only opportunistically skips flipping the period if there happens to be no readers in both periods.
> 
> If there are readers in the "non-current" period, the grace period waits for them.
> 
> If there are readers in the "current" period, it flips the period and then waits for them.

Ok glad you already do that, this is what I was sort of leaning at in my previous email as well, that is doing a hybrid approach. Sorry I did not know the details of your side-RCU to know you were already doing something like that.

> 
>> That said, the approach of scanning both counters does seem attractive
>> for when there are no readers, for the reasons you mentioned. Maybe a
>> heuristic to count the number of readers might help? If we are not
>> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
>> and also couple that heuristic with the number of CPUs. I am
>> interested in working on such a design with you! Let us do it and
>> prototype/measure. ;-)
> 
> Considering that it would add extra complexity, I'm unsure what that extra heuristic would improve over just scanning both periods in the first pass.

Makes sense, I think you indirectly implement a form of heuristic already by flipping in case scanning both was not fruitful.

> I'll be happy to work with you on such a design :) I think we can borrow quite a few concepts from side-rcu for this. Please be aware that my time is limited though, as I'm currently supposed to be on vacation. :)

Oh, I was more referring to after the holidays. I am also starting vacation soon and limited In cycles ;-). It is probably better to enjoy the holidays and come back to this after.

I do want to finish my memory barrier studies of SRCU over the holidays since I have been deep in the hole with that already. Back to the post flip memory barrier here since I think now even that might not be needed…

Cheers,

 - Joel

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>