On Fri, Jan 24, 2025 at 05:42:29PM +0100, Frederic Weisbecker wrote: > Le Fri, Jan 24, 2025 at 07:58:20AM -0800, Paul E. McKenney a écrit : > > On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote: > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() > > > > functions use the root rcu_node structure's ->gp_seq field to detect > > > > the beginnings and ends of grace periods, respectively. This choice is > > > > necessary for the poll_state_synchronize_rcu_full() function because > > > > (give or take counter wrap), the following sequence is guaranteed not > > > > to trigger: > > > > > > > > get_state_synchronize_rcu_full(&rgos); > > > > synchronize_rcu(); > > > > WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); > > > > > > > > The RCU callbacks that awaken synchronize_rcu() instances are > > > > guaranteed not to be invoked before the root rcu_node structure's > > > > ->gp_seq field is updated to indicate the end of the grace period. > > > > However, these callbacks might start being invoked immediately > > > > thereafter, in particular, before rcu_state.gp_seq has been updated. > > > > Therefore, poll_state_synchronize_rcu_full() must refer to the > > > > root rcu_node structure's ->gp_seq field. Because this field is > > > > updated under this structure's ->lock, any code following a call to > > > > poll_state_synchronize_rcu_full() will be fully ordered after the > > > > full grace-period computation, as is required by RCU's memory-ordering > > > > semantics. > > > > > > > > By symmetry, the get_state_synchronize_rcu_full() function should also > > > > use this same root rcu_node structure's ->gp_seq field. But it turns out > > > > that symmetry is profoundly (though extremely infrequently) destructive > > > > in this case. To see this, consider the following sequence of events: > > > > > > > > 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq > > > > accordingly. > > > > > > > > 2. As its first step of grace-period initialization, CPU 0 examines > > > > the current CPU hotplug state and decides that it need not wait > > > > for CPU 1, which is currently offline. > > > > > > > > 3. CPU 1 comes online, and updates its state. But this does not > > > > affect the current grace period, but rather the one after that. > > > > After all, CPU 1 was offline when the current grace period > > > > started, so all pre-existing RCU readers on CPU 1 must have > > > > completed or been preempted before it last went offline. > > > > The current grace period therefore has nothing it needs to wait > > > > for on CPU 1. > > > > > > > > 4. CPU 1 switches to an rcutorture kthread which is running > > > > rcutorture's rcu_torture_reader() function, which starts a new > > > > RCU reader. > > > > > > > > 5. CPU 2 is running rcutorture's rcu_torture_writer() function > > > > and collects a new polled grace-period "cookie" using > > > > get_state_synchronize_rcu_full(). Because the newly started > > > > grace period has not completed initialization, the root rcu_node > > > > structure's ->gp_seq field has not yet been updated to indicate > > > > that this new grace period has already started. > > > > > > > > This cookie is therefore set up for the end of the current grace > > > > period (rather than the end of the following grace period). > > > > > > > > 6. CPU 0 finishes grace-period initialization. > > > > > > > > 7. If CPU 1’s rcutorture reader is preempted, it will be added to > > > > the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not > > > > set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer > > > > will not be updated. Thus, this grace period will not wait on > > > > it. Which is only fair, given that the CPU did not come online > > > > until after the grace period officially started. > > > > > > > > 8. CPUs 0 and 2 then detect the new grace period and then report > > > > a quiescent state to the RCU core. > > > > > > > > 9. Because CPU 1 was offline at the start of the current grace > > > > period, CPUs 0 and 2 are the only CPUs that this grace period > > > > needs to wait on. So the grace period ends and post-grace-period > > > > cleanup starts. In particular, the root rcu_node structure's > > > > ->gp_seq field is updated to indicate that this grace period > > > > has now ended. > > > > > > > > 10. CPU 2 continues running rcu_torture_writer() and sees that, > > > > from the viewpoint of the root rcu_node structure consulted by > > > > the poll_state_synchronize_rcu_full() function, the grace period > > > > has ended. It therefore updates state accordingly. > > > > > > > > 11. CPU 1 is still running the same RCU reader, which notices this > > > > update and thus complains about the too-short grace period. > > > > > > I think I get the race but I must confess I'm not very familiar with how this > > > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader(). > > > > > > Basically this could trigger on CPU 1 with just doing the following, right? > > > > > > rcu_read_lock() > > > get_state_synchronize_rcu_full(&rgos); > > > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos)) > > > rcu_read_unlock() > > > > CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the > > reader, and CPU 2 would do get_state_synchronize_rcu_full(), later > > poll_state_synchronize_rcu_full(), which would (erroneously) indicate > > a completed grace period, so it would update the state, triggering CPU > > 1's checks. > > I see, so if I generalize the problem beyond rcutorture, this looks like this, > right? > > CPU 0 CPU 1 CPU 2 > ----- ------ ----- > > rcu_seq_start(rcu_state.gp_seq) > // CPU boots > rcu_read_lock() > r0 = rcu_dereference(*X) > r1 = *X > *X = NULL > // relies on rnp->gp_seq and not rcu_state.gp_seq > get_state_synchronize_rcu_full(&gros) > WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > rcu_report_qs_rdp() > rcu_report_qs_rdp() > rcu_report_qs_rnp() > rcu_report_qs_rsp() > if (poll_state_synchronize_rcu_full(&rgos)) > kfree(r1) > // play with r0 and crash That would do it! > > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on > > > the root node for the global state ? > > > > One scenario comes to mind immediately. There may be others. > > > > Suppose we were running with default configuration on a system with > > "only" eight CPUs. Then there is only the one rcu_node structure, > > which is both root and leaf. Without rcu_state.gp_seq, there > > would be no way to communicate the beginning of the grace period to > > get_state_synchronize_rcu_full() without also allowing quiescent states > > to be reported. There would thus be no time in which to check for newly > > onlined/offlined CPUs. > > Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part > could be done before rcu_seq_start()? If we do that, aren't we vulnerable to a CPU coming online just after we handled qsmaskinit{,next} and just before we do rcu_seq_start()? Thanx, Paul