----- On Jul 26, 2021, at 2:49 PM, rostedt rostedt@xxxxxxxxxxx wrote: > On Mon, 26 Jul 2021 13:39:18 -0400 (EDT) > Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > >> ----- On Jul 26, 2021, at 12:56 PM, rostedt rostedt@xxxxxxxxxxx wrote: >> >> > On Mon, 26 Jul 2021 11:46:41 -0400 (EDT) >> > Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: >> [...] >> > [...] >> >> AFAIU, none of the synchronization mechanisms you refer to here (memory barrier, >> IPIs..) will change the fact that this CPU may still be delayed across the >> entire >> 1->0->1 transition sequence, and may end up calling the new callback with the >> old data. Unless an explicit RCU-sync is done. > > OK. I see the issue you are saying. And this came from my assumption > that the tracepoint code did a synchronization when unregistering the > last callback. But of course it wont because that would make a lot of > back to back synchronizations of a large number of tracepoints being > unregistered at once. > Something along the lines of the work approach you propose should work. >> >> > >> >> >> >> My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data >> >> changes for transitions 1->2, 2->1, and 1->2 because the priorities don't >> >> guarantee >> >> that the first callback stays in the first position, and we also need to rcu >> >> sync >> >> unconditionally on transition 1->0. We currently only have sync RCU on >> >> transition >> >> from 2->1 when tp_funcs[0].func changes, which is bogus in many ways. >> > >> > Going from 1 to 2, there's no issue. We switch to the iterator, which >> > is the old method anyway. It looks directly at the array and matches >> > the data with the func for each element of that array, and the data >> > read initially (before calling the iterator) is ignored. >> >> This relies on ordering guarantees between RCU assign/dereference and >> static_call >> updates/call. It may well be the case, but I'm asking anyway. >> >> Are we guaranteed of the following ordering ? >> >> CPU A CPU B >> >> static_call_update() > > The static_call_update() triggers an IPI on all CPUs that perform a full memory > barrier. > > That is, nothing on any CPU will cross the static_call_update(). > >> y = rcu_dereference(x) rcu_assign_pointer(x, ...) >> do_static_call(y) >> >> That load of "x" should never happen after the CPU fetches the new static call >> instruction. > > The 'y' will always be the new static call (which is the iterator in > this case), and it doesn't matter which x it read, because the iterator > will read the array just like it was done without static calls. Here by "y" I mean the pointer loaded by rcu_dereference, which is then passed to the call. I do not mean the call per se. I suspect there is something like a control dependency between loading "x" through rcu_dereference and passing the loaded pointer as argument to the static call (and the instruction fetch of the static call). But I don't recall reading any documentation of this specific ordering. > >> >> Also, I suspect that transition 2->1 needs an unconditional rcu-sync because you >> may have a sequence of 3->2->1 (or 1->2->1) where the element 0 data is >> unchanged >> between 2->1, but was changed from 3->2 (or from 1->2), which may be observed by >> the >> static call. > > > I'll agree that we need to add synchronization between 1->0->1, but you > have not convinced me on this second part. In addition to 1->0->1, there are actually 2 other scenarios here: Transition 1->2 to which the ordering question between RCU and static call is related. Transition 2->1 would need an unconditional rcu-sync, because of transitions 3->2->1 and 1->2->1 which can lead the static call to observe wrong data if the rcu_dereference happens when there are e.g. 3 callbacks registered, and then the static call to the function (single callback) is called on the 3-callbacks array. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com