On 2022-12-20 19:07, Frederic Weisbecker wrote:
On Tue, Dec 20, 2022 at 12:00:58PM -0500, Mathieu Desnoyers wrote:
On 2022-12-19 20:04, Joel Fernandes wrote:
The main benefit I expect is improved performance of the grace period
implementation in common cases where there are few or no readers present,
especially on machines with many cpus.
It allows scanning both periods (0/1) for each cpu within the same pass,
therefore loading both period's unlock counters sitting in the same cache
line at once (improved locality), and then loading both period's lock
counters, also sitting in the same cache line.
It also allows skipping the period flip entirely if there are no readers
present, which is an -arguably- tiny performance improvement as well.
I would indeed expect performance improvement if there are no readers in the
active period/idx but if there are, it's a performance penalty due to the extra
scans.
So my mean questions are:
* Is the no-present-readers the most likely case? I guess it depends on the ssp.
* Does the SRCU update side deserve to be optimized with added code (because
we are not debating about removing the flip, rather about adding a fast-path
and keep the flip as a slow-path)
* The SRCU machinery is already quite complicated. Look how we little things lock
ourselves in for days doing our exegesis of SRCU state machine. And halfway
through it we are still debating some ordering. Is it worth adding a new path there?
I'm not arguing for making things more complex unless there are good
reasons to do so. However I think we badly need to improve the
documentation of the memory barriers in SRCU, because the claimed
barrier pairing is odd.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com