On Tue, Feb 04, 2025 at 08:28:29PM -0500, Joel Fernandes wrote: > On Tue, Feb 4, 2025 at 3:22 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > On Tue, Feb 04, 2025 at 10:44:45AM -0500, Joel Fernandes wrote: > > > On Thu, Jan 30, 2025 at 12:56 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > By "where we were initially", one might reasonably guess that you meant > > > > that the initial value and the target are one and the same. But I suspect > > > > that you are thinking of a third value, the value of the underlying > > > > grace-period sequence number at the time of the call to rcu_seq_snap(). > > > > But you might be thinking of something else entirely. > > > > > > > > > Now we move rcu_seq_done_exact. It has the exact same behavior except > > > > > that instead of ULONG_MAX/2, the above situations are shrunk to about > > > > > 10 counts from the target. So if target is 28, then the initial > > > > > sequence should have been at least 18 to avoid false-positive, but > > > > > again it is possible and I certain see in the code that it cannot be > > > > > used this way.. (at least so far).. So all we are left with is the > > > > > false-negative band of ~2.5 GPs.. > > > > > > > > Here, I have the same questions. As you no doubt guessed. > > > > > > > > > About "what are the consequences of failing to get this right". I > > > > > believe false-positive is unacceptable unless it is maybe debug code - > > > > > that can break functionality in code, too short GPs and all that. > > > > > However false-negative is OK as long as the usecase can retry later > > > > > and can afford to wait. Did I get that much correct? > > > > > > > > Maybe? > > > > > > > > Please look at this on a per-use-case basis. > > > > > > Sure. FWIW, I started a world-editable document here. I am planning to > > > work on this a bit more before our meeting this week. If others > > > knowledgeable like Frederic and others could make edits/comments, > > > that'd be welcomed. But I basically summarized this whole thread here: > > > https://docs.google.com/document/d/1sFNuGH4U6DFCE8sunbdWMjFhJhb1aG4goOJAnS6c6gQ/edit?usp=sharing > > > > > > My thought is (AFAICS), this patch is still valid and > > > rcu_seq_done_exact is a potentially better replacement to > > > rcu_seq_done. But famous last words... > > > > Let's start with the use case where the sequence number is being used > > by rcu_barrier(). What happens? > > The rcu_state.barrier_sequence is used to keep track of barrier in > progress, as there can be concurrent rcu_barrier() calls happening so > we optimize by batching multiples of these calls. > > It also handles the case where the entrain IPI came through but no > rcu_barrier() is currently in-flight (which would be weird but maybe > it happens somehow). > > It is also used to make sure we don't need to entrain on a CPU again > if we already did an entraining for a certain rcu_barrier() invocation > already. This is tracked by rdp->barrier_seq_snap which copies the > value of rcu_state.barrier_sequence after the entraining. Al though I > am currently unable to envision why the IPI handler would be called > unnecessarily (hardware bug?) since we hold the rcu_barrier mutex > throughout the rcu_barrier() function, including while waiting on > CPUs. > > Going back to my first point, for the rcu_seq_done() in rcu_barrier(), > we are checking if someone else did our work: > mutex_lock(&rcu_state.barrier_mutex); > /* Did someone else do our work for us? */ > if (rcu_seq_done(&rcu_state.barrier_sequence, s)) > > Here I think rcu_seq_done_exact() is better, because if too many > rcu_barrier() calls concurrently made at the same time, the mutex may > be contended. If enough time passes and we have a wrap around, we > might end up in false-negative land which is much larger with > rcu_seq_done() than ..done_exact(). However, that does seem to be a > bug to begin with if the mutex contended for that long. OTOH, doing an > rcu_barrier() again when we didn't need to doesn't seem like the end > of the world especially if we contended on the mutex for that long. > But at least it shows that rcu_seq_done_exact() cannot seem to hurt > here. > > Hmmm, So... Do we really care about the difference between rcu_seq_done() and rcu_seq_done_exact() in the rcu_barrier() case? Thanx, Paul