On 2019/11/4 5:25 下午, Sebastian Andrzej Siewior wrote:
On 2019-11-02 12:45:59 [+0000], Lai Jiangshan wrote:
Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
is that accessing per-cpu variables is a lot cheaper than accessing
task_struct or thread_info variables.
Is there a benchmark saying how much we gain from this?
Hello
Maybe I can write a tight loop for testing, but I don't
think anyone will be interesting in it.
I'm also trying to find some good real tests. I need
some suggestions here.
We need to save/restore the actual rcu_preempt_depth when switch.
We also place the per-cpu rcu_preempt_depth close to __preempt_count
and current_task variable.
Using the idea of per-cpu __preempt_count.
No function call when using rcu_read_[un]lock().
Single instruction for rcu_read_lock().
2 instructions for fast path of rcu_read_unlock().
I think these were not inlined due to the header requirements.
objdump -D -S kernel/workqueue.o shows (selected fractions):
raw_cpu_add_4(__rcu_preempt_depth, 1);
d8f: 65 ff 05 00 00 00 00 incl %gs:0x0(%rip) #
d96 <work_busy+0x16>
......
return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e,
__percpu_arg([var]));
dd8: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) #
ddf <work_busy+0x5f>
if (unlikely(rcu_preempt_depth_dec_and_test()))
ddf: 74 26 je e07 <work_busy+0x87>
......
rcu_read_unlock_special();
e07: e8 00 00 00 00 callq e0c <work_busy+0x8c>
Boris pointed one thing, there is also DEFINE_PERCPU_RCU_PREEMP_DEPTH.
Thanks for pointing out.
Best regards
Lai