On Mon, Jul 31, 2017 at 1:02 AM, Vlastimil Babka <vbabka@xxxxxxx> wrote: > On 07/31/2017 06:01 AM, Dima Zavin wrote: >> In codepaths that use the begin/retry interface for reading >> mems_allowed_seq with irqs disabled, there exists a race condition that >> stalls the patch process after only modifying a subset of the >> static_branch call sites. >> >> This problem manifested itself as a dead lock in the slub >> allocator, inside get_any_partial. The loop reads >> mems_allowed_seq value (via read_mems_allowed_begin), >> performs the defrag operation, and then verifies the consistency >> of mem_allowed via the read_mems_allowed_retry and the cookie >> returned by xxx_begin. The issue here is that both begin and retry >> first check if cpusets are enabled via cpusets_enabled() static branch. >> This branch can be rewritted dynamically (via cpuset_inc) if a new >> cpuset is created. The x86 jump label code fully synchronizes across >> all CPUs for every entry it rewrites. If it rewrites only one of the >> callsites (specifically the one in read_mems_allowed_retry) and then >> waits for the smp_call_function(do_sync_core) to complete while a CPU is >> inside the begin/retry section with IRQs off and the mems_allowed value >> is changed, we can hang. This is because begin() will always return 0 >> (since it wasn't patched yet) while retry() will test the 0 against >> the actual value of the seq counter. >> >> The fix is to use two different static keys: one for begin >> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we >> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() >> always return a valid seqcount if are enabling cpusets. Similarly, >> when disabling cpusets via cpuset_dec(), we first ensure that callers >> of cpuset_mems_allowed_retry() will start ignoring the seqcount >> value before we let cpuset_mems_allowed_begin() return 0. >> >> The relevant stack traces of the two stuck threads: >> >> CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 >> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 >> task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 >> RIP: smp_call_function_many+0x1f9/0x260 >> Call Trace: >> ? setup_data_read+0xa0/0xa0 >> ? ___slab_alloc+0x28b/0x5a0 >> smp_call_function+0x3b/0x70 >> ? setup_data_read+0xa0/0xa0 >> on_each_cpu+0x2f/0x90 >> ? ___slab_alloc+0x28a/0x5a0 >> ? ___slab_alloc+0x28b/0x5a0 >> text_poke_bp+0x87/0xd0 >> ? ___slab_alloc+0x28a/0x5a0 >> arch_jump_label_transform+0x93/0x100 >> __jump_label_update+0x77/0x90 >> jump_label_update+0xaa/0xc0 >> static_key_slow_inc+0x9e/0xb0 >> cpuset_css_online+0x70/0x2e0 >> online_css+0x2c/0xa0 >> cgroup_apply_control_enable+0x27f/0x3d0 >> cgroup_mkdir+0x2b7/0x420 >> kernfs_iop_mkdir+0x5a/0x80 >> vfs_mkdir+0xf6/0x1a0 >> SyS_mkdir+0xb7/0xe0 >> entry_SYSCALL_64_fastpath+0x18/0xad >> >> ... >> >> CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 >> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 >> task: ffff8818087c0000 task.stack: ffffc90000030000 >> RIP: int3+0x39/0x70 >> Call Trace: >> <#DB> ? ___slab_alloc+0x28b/0x5a0 >> <EOE> ? copy_process.part.40+0xf7/0x1de0 >> ? __slab_alloc.isra.80+0x54/0x90 >> ? copy_process.part.40+0xf7/0x1de0 >> ? copy_process.part.40+0xf7/0x1de0 >> ? kmem_cache_alloc_node+0x8a/0x280 >> ? copy_process.part.40+0xf7/0x1de0 >> ? _do_fork+0xe7/0x6c0 >> ? _raw_spin_unlock_irq+0x2d/0x60 >> ? trace_hardirqs_on_caller+0x136/0x1d0 >> ? entry_SYSCALL_64_fastpath+0x5/0xad >> ? do_syscall_64+0x27/0x350 >> ? SyS_clone+0x19/0x20 >> ? do_syscall_64+0x60/0x350 >> ? entry_SYSCALL64_slow_path+0x25/0x25 >> >> Reported-by: Cliff Spradlin <cspradlin@xxxxxxxxx> >> Signed-off-by: Dima Zavin <dmitriyz@xxxxxxxxx> > > Looks good. Could you verify it fixes the issue, or was it too hard to > reproduce? Also is this a stable candidate patch, and can you identify > an exact commit hash it fixes? It's tough to reproduce but this fix works as well as the original hack. I think the problematic commit must have been: commit 46e700abc44ce215acb4341d9702ce3972eda571 Author: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Date: Fri Nov 6 16:28:15 2015 -0800 mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled It's probably stable worthy, but I don't know who can make the call on that. Thanks for the reviews! --Dima > > Acked-by: Vlastimil Babka <vbabka@xxxxxxx> > >> --- >> >> v3: >> - Changed the implementation based on Peter Zijlstra's suggestion. Now >> using two keys for begin/retry instead of hacking the state into the >> cookie. >> - Rebased and tested on top of v4.13-rc3. >> >> v4: >> - Moved the cached cpusets_enabled() state into the cookie, turned >> the cookie into a struct and updated all the other call sites. >> - Applied on top of v4.12 since one of the callers in page_alloc.c changed. >> Still only tested on v4.9.36 and compile tested against v4.12. >> >> include/linux/cpuset.h | 19 +++++++++++++++++-- >> kernel/cgroup/cpuset.c | 1 + >> 2 files changed, 18 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h >> index 119a3f9604b0..e5a684c04c70 100644 >> --- a/include/linux/cpuset.h >> +++ b/include/linux/cpuset.h >> @@ -18,6 +18,19 @@ >> >> #ifdef CONFIG_CPUSETS >> >> +/* >> + * Static branch rewrites can happen in an arbitrary order for a given >> + * key. In code paths where we need to loop with read_mems_allowed_begin() and >> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need >> + * to ensure that begin() always gets rewritten before retry() in the >> + * disabled -> enabled transition. If not, then if local irqs are disabled >> + * around the loop, we can deadlock since retry() would always be >> + * comparing the latest value of the mems_allowed seqcount against 0 as >> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled >> + * transition should happen in reverse order for the same reasons (want to stop >> + * looking at real value of mems_allowed.sequence in retry() first). >> + */ >> +extern struct static_key_false cpusets_pre_enable_key; >> extern struct static_key_false cpusets_enabled_key; >> static inline bool cpusets_enabled(void) >> { >> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void) >> >> static inline void cpuset_inc(void) >> { >> + static_branch_inc(&cpusets_pre_enable_key); >> static_branch_inc(&cpusets_enabled_key); >> } >> >> static inline void cpuset_dec(void) >> { >> static_branch_dec(&cpusets_enabled_key); >> + static_branch_dec(&cpusets_pre_enable_key); >> } >> >> extern int cpuset_init(void); >> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void); >> */ >> static inline unsigned int read_mems_allowed_begin(void) >> { >> - if (!cpusets_enabled()) >> + if (!static_branch_unlikely(&cpusets_pre_enable_key)) >> return 0; >> >> return read_seqcount_begin(¤t->mems_allowed_seq); >> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void) >> */ >> static inline bool read_mems_allowed_retry(unsigned int seq) >> { >> - if (!cpusets_enabled()) >> + if (!static_branch_unlikely(&cpusets_enabled_key)) >> return false; >> >> return read_seqcount_retry(¤t->mems_allowed_seq, seq); >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index ca8376e5008c..8d5151688504 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -63,6 +63,7 @@ >> #include <linux/cgroup.h> >> #include <linux/wait.h> >> >> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); >> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); >> >> /* See "Frequency meter" comments, below. */ >> > -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html