On Wed, Feb 5, 2025 at 5:28 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Tue, Feb 04, 2025 at 08:28:29PM -0500, Joel Fernandes wrote: > > On Tue, Feb 4, 2025 at 3:22 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > On Tue, Feb 04, 2025 at 10:44:45AM -0500, Joel Fernandes wrote: > > > > On Thu, Jan 30, 2025 at 12:56 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > By "where we were initially", one might reasonably guess that you meant > > > > > that the initial value and the target are one and the same. But I suspect > > > > > that you are thinking of a third value, the value of the underlying > > > > > grace-period sequence number at the time of the call to rcu_seq_snap(). > > > > > But you might be thinking of something else entirely. > > > > > > > > > > > Now we move rcu_seq_done_exact. It has the exact same behavior except > > > > > > that instead of ULONG_MAX/2, the above situations are shrunk to about > > > > > > 10 counts from the target. So if target is 28, then the initial > > > > > > sequence should have been at least 18 to avoid false-positive, but > > > > > > again it is possible and I certain see in the code that it cannot be > > > > > > used this way.. (at least so far).. So all we are left with is the > > > > > > false-negative band of ~2.5 GPs.. > > > > > > > > > > Here, I have the same questions. As you no doubt guessed. > > > > > > > > > > > About "what are the consequences of failing to get this right". I > > > > > > believe false-positive is unacceptable unless it is maybe debug code - > > > > > > that can break functionality in code, too short GPs and all that. > > > > > > However false-negative is OK as long as the usecase can retry later > > > > > > and can afford to wait. Did I get that much correct? > > > > > > > > > > Maybe? > > > > > > > > > > Please look at this on a per-use-case basis. > > > > > > > > Sure. FWIW, I started a world-editable document here. I am planning to > > > > work on this a bit more before our meeting this week. If others > > > > knowledgeable like Frederic and others could make edits/comments, > > > > that'd be welcomed. But I basically summarized this whole thread here: > > > > https://docs.google.com/document/d/1sFNuGH4U6DFCE8sunbdWMjFhJhb1aG4goOJAnS6c6gQ/edit?usp=sharing > > > > > > > > My thought is (AFAICS), this patch is still valid and > > > > rcu_seq_done_exact is a potentially better replacement to > > > > rcu_seq_done. But famous last words... > > > > > > Let's start with the use case where the sequence number is being used > > > by rcu_barrier(). What happens? > > > > The rcu_state.barrier_sequence is used to keep track of barrier in > > progress, as there can be concurrent rcu_barrier() calls happening so > > we optimize by batching multiples of these calls. > > > > It also handles the case where the entrain IPI came through but no > > rcu_barrier() is currently in-flight (which would be weird but maybe > > it happens somehow). > > > > It is also used to make sure we don't need to entrain on a CPU again > > if we already did an entraining for a certain rcu_barrier() invocation > > already. This is tracked by rdp->barrier_seq_snap which copies the > > value of rcu_state.barrier_sequence after the entraining. Al though I > > am currently unable to envision why the IPI handler would be called > > unnecessarily (hardware bug?) since we hold the rcu_barrier mutex > > throughout the rcu_barrier() function, including while waiting on > > CPUs. > > > > Going back to my first point, for the rcu_seq_done() in rcu_barrier(), > > we are checking if someone else did our work: > > mutex_lock(&rcu_state.barrier_mutex); > > /* Did someone else do our work for us? */ > > if (rcu_seq_done(&rcu_state.barrier_sequence, s)) > > > > Here I think rcu_seq_done_exact() is better, because if too many > > rcu_barrier() calls concurrently made at the same time, the mutex may > > be contended. If enough time passes and we have a wrap around, we > > might end up in false-negative land which is much larger with > > rcu_seq_done() than ..done_exact(). However, that does seem to be a > > bug to begin with if the mutex contended for that long. OTOH, doing an > > rcu_barrier() again when we didn't need to doesn't seem like the end > > of the world especially if we contended on the mutex for that long. > > But at least it shows that rcu_seq_done_exact() cannot seem to hurt > > here. > > > > Hmmm, > > So... Do we really care about the difference between rcu_seq_done() > and rcu_seq_done_exact() in the rcu_barrier() case? In so far as to get rid of rcu_seq_done() API in favor of using rcu_seq_done_exact() but should not cause any functional or robustness changes to rcu_barrier() at least, AFAICS. thanks, - Joel