Rather than using the generic version of preempt_count, x86 uses a special version of preempt_count implementation that take advantage of x86 (single instruction to access/add/decl/flip-bits of percpu counter). It makes the preempt_count operations really cheap. For x86, rcu_preempt_depth can also take the same advantage by using the same technique. After the patchset: - No function call when using rcu_read_[un]lock(). It is minor improvement, other arch can also achieve it via moving ->rcu_read_lock_nesting and ->rcu_read_unlock_special to thread_info, but inlined rcu_read_[un]lock() generates more instructions and footprint in other arch generally. - Only single instruction for rcu_read_lock(). - Only 2 instructions for the fast path of rcu_read_unlock(). Patch4 simplifies rcu_read_unlock() by avoid using negative ->rcu_read_lock_nesting, Patch7 introduces the percpu rcu_preempt_depth. Other patches are for preparation. changed from v1: drop patch1/2 of the v1 drop merged patches Using special.b.deferred_qs to avoid wakeup in v1 is changed to using preempt_count. And special.b.deferred_qs is removed. Lai Jiangshan (7): rcu: use preempt_count to test whether scheduler locks is held rcu: cleanup rcu_preempt_deferred_qs() rcu: remove useless special.b.deferred_qs rcu: don't use negative ->rcu_read_lock_nesting rcu: wrap usages of rcu_read_lock_nesting rcu: clear the special.b.need_qs in rcu_note_context_switch() x86,rcu: use percpu rcu_preempt_depth arch/x86/Kconfig | 2 + arch/x86/include/asm/rcu_preempt_depth.h | 87 +++++++++++++++++++ arch/x86/kernel/cpu/common.c | 7 ++ arch/x86/kernel/process_32.c | 2 + arch/x86/kernel/process_64.c | 2 + include/linux/rcupdate.h | 24 ++++++ include/linux/sched.h | 2 +- init/init_task.c | 2 +- kernel/fork.c | 2 +- kernel/rcu/Kconfig | 3 + kernel/rcu/tree_exp.h | 35 ++------ kernel/rcu/tree_plugin.h | 101 ++++++++++++++--------- 12 files changed, 196 insertions(+), 73 deletions(-) create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h -- 2.20.1