On 20/03/24 15:03, Daniel Vacek wrote: > Hi Valentin, > > On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider <vschneid@xxxxxxxxxx> wrote: >> On 18/03/24 12:17, Daniel Vacek wrote: >> > Bill Peters reported CPU hangs while offlining/onlining CPUs on s390. >> > >> > Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()` >> > can fail when racing with off-/on-lining resulting in a deadlock waiting for >> > the pending migration stop work completion which is never done. >> > >> > Fix this by correctly handling such a condition. >> > >> >> IIUC the problem is that the dest_cpu and its stopper thread can be taken >> down by take_cpu_down(), and affine_move_task() currently isn't aware of >> that. I thought we had tested this vs hotplug, but oh well... > > I'm sorry, I should have provided more context in the first place. The machine > is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces > show this scenario: > > CPU 0 | CPU 1 > | > cpuplugd task 1429 | > holds the `cpu_hotplug_lock` | > for writing in _cpu_up+0x16a | > blocked on `cpuhp_state:1.done_up` | > completion in __cpuhp_kick_ap+0x76 | > | > | cpuhp/1 task 17 > |supposed to complete bringup of the CPU > | (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108 > |blocked on `wq_pool_attach_mutex` > | in workqueue_online_cpu+0x9e > | > xfs-conv/dm-0 task 745 | > holds the `wq_pool_attach_mutex` | > in worker_attach_to_pool+0x66 \ > blocked on `task->migration_pending->done`| > completion in affine_move_task+0x10a/ > > ~~~ > crash> b 1429 > PID: 1429 TASK: 99398000 CPU: 0 COMMAND: "cpuplugd" > #0 [997df970] __schedule+0x34c at 3089c424 > #1 [997df9e0] schedule+0x7e at 3089cafe > #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e > [inlined] do_wait_for_common > [inlined] __wait_for_common > #3 [997dfad8] wait_for_common+0x14a at 3089d902 > [ret call] wait_for_completion+0x1a at 3089d96a > > [inlined] wait_for_ap_thread <<< blocked on `cpuhp_state:1.done_up` completion > [ret call] __cpuhp_kick_ap+0x76 at 300c610e > #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc > [inlined] bringup_wait_for_ap > [ret call] bringup_cpu+0xea at 300c6402 > #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14 > #6 [997dfc40] _cpu_up+0x16a at 300c798a <<< holds the `cpu_hotplug_lock` for writing > #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66 > #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00 > #9 [997dfd28] device_online+0x9e at 30598e7e > #10 [997dfd68] online_store+0x88 at 30598f28 > #11 [997dfda8] kernfs_fop_write+0xdc at 3040276c > #12 [997dfdf8] vfs_write+0xa8 at 30354760 > #13 [997dfe58] ksys_write+0x62 at 30354a32 > > crash> cpuhp_cpu_state.state cpuhp_state:a | paste - - > [0]: 1aef424e0 state = CPUHP_ONLINE, # (195) > [1]: 1aef654e0 state = CPUHP_AP_WORKQUEUE_ONLINE, # (159) > > crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - - > [1]: 1aef654e0 bringup = true, thread = 0x81134400, done_up.done = 0, <<< > > crash> b 17 > PID: 17 TASK: 81134400 CPU: 1 COMMAND: "cpuhp/1" > #0 [81143b68] __schedule+0x34c at 3089c424 > #1 [81143bd8] schedule+0x7e at 3089cafe > #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba > #3 [81143c30] __mutex_lock+0x320 at 3089df60 > > #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e <<< blocked on `wq_pool_attach_mutex` > #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14 > #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848 <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`) > > crash> b 745 > PID: 745 TASK: 82359100 CPU: 0 COMMAND: "xfs-conv/dm-0" > #0 [8b4bfa20] __schedule+0x34c at 3089c424 > #1 [8b4bfa90] schedule+0x7e at 3089cafe > #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e > [inlined] do_wait_for_common > [inlined] __wait_for_common > #3 [8b4bfb88] wait_for_common+0x14a at 3089d902 > [ret call] wait_for_completion+0x1a at 3089d96a > > #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a <<< blocked on `task->migration_pending->done` completion > #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926 > [ret call] set_cpus_allowed_ptr+0xa at 300fba32 > #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae <<< holds the `wq_pool_attach_mutex` > #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac > > crash> rx 8b4bfea0 > 8b4bfea0: [863373c0:kmalloc-192] > > crash> worker.task,rescue_wq 863373c0 > task = 0x82359100, > rescue_wq = 0x8aa44400, > > crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - - > 1fffff7f751900 pool = 0x1aef56a00, > > crash> worker_pool.attrs 0x1aef56a00 > attrs = 0x80088180, > > crash> workqueue_attrs.cpumask[0].bits 0x80088180 > cpumask[0].bits = {0x1, 0x0, ... > > crash> cpumask.bits __cpu_active_mask > bits = {0x1, 0x0, ... > > crash> cpumask.bits __cpu_online_mask > bits = {0x3, 0x0, ... > > crash> task_struct.migration_pending,flags 0x82359100 > migration_pending = 0x8b4bfce8, > flags = 0x4208060, > ^ PF_KTHREAD > > crash> pd distribute_cpu_mask_prev:0 > per_cpu(distribute_cpu_mask_prev, 0) = 0 > > crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d > refs.refs.counter = 1 > arg = { > task = 0x82359100, > dest_cpu = 0, > pending = 0x8b4bfce8 > } > stop_pending = 1, > done.done = 0, > ~~~ > > In other words the `set_cpus_allowed_ptr()` is called from a worker thread which > tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed > to be the destination as per the stack structure. In this case I thought it's OK > to leave the task on the old CPU AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task() and down to the stopper call, that means the task isn't allowed on its current CPU and needs to be moved. > and the Bill's testing scenario was successful > with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing. > > This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's > still reproducible. Also, I'm not sure why this only happens on s390 and not on > x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing > sensitive and the timing will differ greatly for sure. > Thanks for the extra context! Double checking what I wrote before, I forgot RCU considers preempt-off sections as read-side critical sections. __set_cpus_allowed_ptr() already has preemption disabled all the way from reading the cpu_active_mask to the stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable(). IOW we have: __set_cpus_allowed_ptr() task_rq_lock() <-- PREEMPT OFF __set_cpus_allowed_ptr_locked() cpu_valid_mask = cpu_active_mask; dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask); affine_move_task() preempt_disable(); task_rq_unlock(); stop_one_cpu_nowait(); <-- preemption still OFF And, considering: sched_cpu_deactivate() set_cpu_active(cpu, false); synchronize_rcu(); Then, if __set_cpus_allowed_ptr() observes a CPU as being in the cpu_active_mask and uses that one as a destination CPU, said CPU cannot reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread because its hotplug machinery will wait on the synchronize_rcu() in CPUHP_AP_ACTIVE:sched_cpu_deactivate(). So "in theory", this shouldn't happen upstream.