On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote: > Note: A previous approach by PeterZ [1] used an extra bit in > context_tracking.state to flag the presence of deferred callbacks to > execute, and the actual callbacks were stored in a separate atomic > variable. > > This meant that the atomic read of context_tracking.state was sufficient to > determine whether there are any deferred callbacks to execute. > Unfortunately, it presents a race window. Consider the work setting > function as: > > preempt_disable(); > seq = atomic_read(&ct->seq); > if (__context_tracking_seq_in_user(seq)) { > /* ctrl-dep */ > atomic_or(work, &ct->work); > ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); > } > preempt_enable(); > > return ret; > > Then the following can happen: > > CPUx CPUy > CT_SEQ_WORK \in context_tracking.state > atomic_or(WORK_N, &ct->work); > ct_kernel_enter() > ct_state_inc(); > atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); > > The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be > sent. Unfortunately, the work bit would remain set, and it can't be sanely > cleared in case another CPU set it concurrently - this would ultimately > lead to a double execution of the callback, one as a deferred callback and > one in the IPI. As not all IPI callbacks are idempotent, this is > undesirable. So adding another atomic is arguably worse. The thing is, if the NOHZ_FULL CPU is actually doing context transitions (SYSCALLs etc..) then everything is fundamentally racy, there is no winning that game, we could find the remote CPU is in-kernel, send an IPI, the remote CPU does return-to-user and receives the IPI. And then the USER is upset... because he got an IPI. The whole NOHZ_FULL thing really only works if userspace does not do SYSCALLs. But the sad sad state of affairs is that some people think it is acceptable to do SYSCALLs while NOHZ_FULL and cry about how slow stuff is.