Hi Paul, Answering a question from Peter on IRC got me to look at rcu_read_lock_trace(), and I see this: static inline void rcu_read_lock_trace(void) { struct task_struct *t = current; WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1); barrier(); if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) && t->trc_reader_special.b.need_mb) smp_mb(); // Pairs with update-side barriers rcu_lock_acquire(&rcu_trace_lock_map); } static inline void rcu_read_unlock_trace(void) { int nesting; struct task_struct *t = current; rcu_lock_release(&rcu_trace_lock_map); nesting = READ_ONCE(t->trc_reader_nesting) - 1; barrier(); // Critical section before disabling. // Disable IPI-based setting of .need_qs. WRITE_ONCE(t->trc_reader_nesting, INT_MIN); if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) { WRITE_ONCE(t->trc_reader_nesting, nesting); return; // We assume shallow reader nesting. } rcu_read_unlock_trace_special(t, nesting); } AFAIU, each thread keeps track of whether it is nested within a RCU read-side critical section with a counter, and grace periods iterate over all threads to make sure they are not within a read-side critical section before they can complete: # define rcu_tasks_trace_qs(t) \ do { \ if (!likely(READ_ONCE((t)->trc_reader_checked)) && \ !unlikely(READ_ONCE((t)->trc_reader_nesting))) { \ smp_store_release(&(t)->trc_reader_checked, true); \ smp_mb(); /* Readers partitioned by store. */ \ } \ } while (0) It reminds me of the liburcu urcu-mb flavor which also deals with per-thread state to track whether threads are nested within a critical section: https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L90 https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L125 static inline void _urcu_mb_read_lock_update(unsigned long tmp) { if (caa_likely(!(tmp & URCU_GP_CTR_NEST_MASK))) { _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, _CMM_LOAD_SHARED(urcu_mb_gp.ctr)); cmm_smp_mb(); } else _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, tmp + URCU_GP_COUNT); } static inline void _urcu_mb_read_lock(void) { unsigned long tmp; urcu_assert(URCU_TLS(urcu_mb_reader).registered); cmm_barrier(); tmp = URCU_TLS(urcu_mb_reader).ctr; urcu_assert((tmp & URCU_GP_CTR_NEST_MASK) != URCU_GP_CTR_NEST_MASK); _urcu_mb_read_lock_update(tmp); } The main difference between the two algorithm is that task-trace within the kernel lacks the global "urcu_mb_gp.ctr" state snapshot, which is either incremented or flipped between 0 and 1 by the grace period. This allow RCU readers outermost nesting starting after the beginning of the grace period not to prevent progress of the grace period. Without this, a steady flow of incoming tasks-trace-RCU readers can prevent the grace period from ever completing. Or is this handled in a clever way that I am missing here ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com