On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: > Hello, > > I observe that running stress-ng with the cpu-online and fstat tests > results in a deadlock of hung tasks: > > [ 366.810486] INFO: task stress-ng-cpu-o:2590 blocked for more than > 120 seconds. > [ 366.817689] Not tainted 4.9.0 #39 > [ 366.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 366.829320] stress-ng-cpu-o D 0 2590 2589 0x00000008 > [ 366.834803] Call trace: > [ 366.837222] [<ffff000008085dd0>] __switch_to+0x60/0x70 > [ 366.842338] [<ffff000008a23c18>] __schedule+0x178/0x648 > [ 366.847550] [<ffff000008a24120>] schedule+0x38/0x98 > [ 366.852408] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8 > [ 366.858749] [<ffff00000848e9d4>] blk_mq_queue_reinit_work+0x74/0x110 > [ 366.865081] [<ffff00000848ea94>] blk_mq_queue_reinit_dead+0x24/0x30 > [ 366.871335] [<ffff0000080c9898>] cpuhp_invoke_callback+0x98/0x4a8 > [ 366.877411] [<ffff0000080cb084>] cpuhp_down_callbacks+0x114/0x150 > [ 366.883484] [<ffff000008a22578>] _cpu_down+0x100/0x1d8 > [ 366.888609] [<ffff0000080cbfdc>] do_cpu_down+0x4c/0x78 > [ 366.893727] [<ffff0000080cc02c>] cpu_down+0x24/0x30 > [ 366.898593] [<ffff0000086aaf28>] cpu_subsys_offline+0x20/0x30 > [ 366.904318] [<ffff0000086a53d8>] device_offline+0xa8/0xd8 > [ 366.909704] [<ffff0000086a550c>] online_store+0x4c/0xa8 > [ 366.914907] [<ffff0000086a241c>] dev_attr_store+0x44/0x60 > [ 366.920294] [<ffff0000082b6a24>] sysfs_kf_write+0x5c/0x78 > [ 366.925672] [<ffff0000082b5cec>] kernfs_fop_write+0xbc/0x1e8 > [ 366.931318] [<ffff000008238320>] __vfs_write+0x48/0x138 > [ 366.936526] [<ffff000008239078>] vfs_write+0xa8/0x1c0 > [ 366.941557] [<ffff00000823a08c>] SyS_write+0x54/0xb0 > [ 366.946511] [<ffff000008083370>] el0_svc_naked+0x24/0x28 > [ 366.951800] INFO: task stress-ng-fstat:2591 blocked for more than > 120 seconds. > [ 366.959008] Not tainted 4.9.0 #39 > [ 366.962823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 366.970640] stress-ng-fstat D 0 2591 2589 0x00000000 > [ 366.976105] Call trace: > [ 366.978540] [<ffff000008085dd0>] __switch_to+0x60/0x70 > [ 366.983658] [<ffff000008a23c18>] __schedule+0x178/0x648 > [ 366.988870] [<ffff000008a24120>] schedule+0x38/0x98 > [ 366.993727] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8 > [ 367.000068] [<ffff00000848e2d0>] blk_mq_freeze_queue+0x28/0x38 > [ 367.005880] [<ffff0000086d480c>] lo_release+0x64/0x90 > [ 367.010919] [<ffff000008278bd4>] __blkdev_put+0x26c/0x2c8 > [ 367.016300] [<ffff000008278fec>] blkdev_put+0x54/0x128 > [ 367.021418] [<ffff0000082790ec>] blkdev_close+0x2c/0x40 > [ 367.026631] [<ffff00000823ab58>] __fput+0xa0/0x1e0 > [ 367.031401] [<ffff00000823ad10>] ____fput+0x20/0x30 > [ 367.036266] [<ffff0000080e7a40>] task_work_run+0xc8/0xe8 > [ 367.041557] [<ffff0000080882b4>] do_notify_resume+0xac/0xb8 > [ 367.047116] [<ffff000008083294>] work_pending+0x8/0x10 > > I have tested and found this issue to be reproducible on both x86 > and ARM64 architectures on 4.7, 4.8, 4.9, 4.10, and 4.11-rc3 > kernels. > > Using the below test methodology [1], the issue reproduces within a > few minutes. > > Using ftrace, I have analyzed the issue on 4.9 and I believe I've > found the root cause [2]. > > Based on my analysis, I have developed a fix [3], which addresses > the issue as I am able to run stress-ng for over an hour where I was > unable to do so before, however I do not know the full extend of > impacts from this fix, and look for guidance from the community to > determine the final fix. > > > [1] Test methodology > -------------------- > Boot a multicore system such as a desktop i5 system with nr_cpus=2 > > Enable all logging to determine when the deadlock occurs (prints > from test stop flowing out of the serial port) > echo 1 > /sys/module/printk/parameters/ignore_loglevel > > Run stress-ng > stress-ng --fstat 1 --cpu-online 1 -t 3600 > > Wait for the test output to stop, and the hung task watchdog to fire. > > > [2] Analysis > ------------ > Again, this analysis is based upon the 4.9 kernel, but believe it to > still apply to newer kernels. > > I conclude that the hung tasks occur due to a race condition which > results in a deadlock. > > The race condition occurs between "normal" work in the block layer > on a core (the stress-ng-fstat task in the above dump) and cpu > offline of that core (the stress-ng-cpu-o task in the above dump). > > The fput() from userspace in the fstat task results in a call to > blk_mq_freeze_queue(), which drops the last reference to the queue > via percpu_ref_kill(), and then waits for the ref count of the queue > to hit 0 in blk_mq_freeze_queue_wait(). percpu_ref_kill() will > result in __percpu_ref_switch_to_atomic() which will use > call_rcu_sched() to setup delayed work to finalize the percpu_ref > cleanup and drop the ref count to 0. > > Note that call_rcu_sched() queues the work to a per-cpu queue, thus > the work can only be run on the core it is queued on, by the work > thread that is pinned to that core. > > It is a race between this work running, and the cpu offline processing. One quick way to test this assumption is to build a kernel with Kconfig options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will cause call_rcu_sched() to queue the work to a kthread, which can migrate to some other CPU. If your analysis is correct, this should avoid the deadlock. (Note that the deadlock should be fixed in any case, just a diagnostic assumption-check procedure.) > If the cpu offline processing is able to get to and process the > RCU/tree:online state before the queued work from the block layer, > then the pinned work thread will be migrated to another core via > rcutree_offline_cpu(), and the work will not be able to execute. > > This race condition does not result in deadlock until later in the > cpu offline processing. Once we hit the block/mq:prepare state the > block layer freezes all the queues and waits for the ref counts to > hit 0. This normally works because at this point the cpu being > offlined is dead from cpu:teardown, and the offline processing is > occuring on another active cpu, so call_rcu_sched() will queue work > to an active cpu where it can get processed. However the fstat > process already did that work for one of the queues to be frozen in > the block layer, so the processing of the block/mq:prepare state > waits on the same ref count as fstat to hit 0. Thus we see the > result of this as the stress-ng-cpu-o task above. > > The block/mq:prepare processing stalls the cpu offline processing > which causes a deadlock because the processing does not get to the > RCU/tree:prepare state which migrates all of the queued work from > the offline cpu to another cpu, which would allow the work that the > fstat task queued to execute, drop the ref count to 0, and unblock > both stalled tasks. > > By reordering the cpu offline states such the shutdown processing of > RCU/tree:prepare occurs before block/mq:prepare [3], we prevent > deadlock by enabling the queued work in the RCU framework to run > elsewhere, and eventually unblock the tasks waiting on the ref > count. > > However, it is not entirely clear what are the full ramifications of > this reorder. I assume the ordering of these cpu online/offline > states is carefully considered, and without that knowledge, I could > not say for certain that my fix [3] is safe. > > What is the opinion of the domain experts? I do hope that we can come up with a better fix. No offense intended, as coming up with -any- fix in the CPU-hotplug domain is not to be denigrated, but this looks to be at vest quite fragile. Thanx, Paul > -- > Jeffrey Hugo > Qualcomm Datacenter Technologies as an affiliate of Qualcomm > Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the > Code Aurora Forum, a Linux Foundation Collaborative Project. > > > [3] Proposed fix > ---8>--- > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h > index afe641c..9b86db9 100644 > --- a/include/linux/cpuhotplug.h > +++ b/include/linux/cpuhotplug.h > @@ -49,6 +49,7 @@ enum cpuhp_state { > CPUHP_ARM_SHMOBILE_SCU_PREPARE, > CPUHP_SH_SH3X_PREPARE, > CPUHP_BLK_MQ_PREPARE, > + CPUHP_RCUTREE_PREP2, > CPUHP_TIMERS_DEAD, > CPUHP_NOTF_ERR_INJ_PREPARE, > CPUHP_MIPS_SOC_PREPARE, > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 29de1a9..b46c573 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void) > [CPUHP_RCUTREE_PREP] = { > .name = "RCU/tree:prepare", > .startup.single = rcutree_prepare_cpu, > + .teardown.single = NULL, > + }, > + [CPUHP_RCUTREE_PREP2] = { > + .name = "RCU/tree:dead", > + .startup.single = NULL, > .teardown.single = rcutree_dead_cpu, > }, > /* >