Re: Sleeping while atomic regression around hd_struct_free() in 5.9-rc

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 28 Aug 2020 21:05:14 +0800

Hi Ilya,

Thanks for your report!

On Fri, Aug 28, 2020 at 12:32:48PM +0200, Ilya Dryomov wrote:
> Hi Ming,
> 
> There seems to be a sleeping while atomic bug in hd_struct_free():
> 
> 288 static void hd_struct_free(struct percpu_ref *ref)
> 289 {
> 290         struct hd_struct *part = container_of(ref, struct hd_struct, ref);
> 291         struct gendisk *disk = part_to_disk(part);
> 292         struct disk_part_tbl *ptbl =
> 293                 rcu_dereference_protected(disk->part_tbl, 1);
> 294
> 295         rcu_assign_pointer(ptbl->last_lookup, NULL);
> 296         put_device(disk_to_dev(disk));
> 297
> 298         INIT_RCU_WORK(&part->rcu_work, hd_struct_free_work);
> 299         queue_rcu_work(system_wq, &part->rcu_work);
> 300 }
> 
> hd_struct_free() is a percpu_ref release callback and must not sleep.
> But put_device() can end up in disk_release(), resulting in anything
> from "sleeping function called from invalid context" splats to actual
> lockups if the queue ends up being released:
> 
>   BUG: scheduling while atomic: ksoftirqd/3/26/0x00000102
>   INFO: lockdep is turned off.
>   CPU: 3 PID: 26 Comm: ksoftirqd/3 Tainted: G        W
> 5.9.0-rc2-ceph-g2de49bea2ebc #1
>   Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
>   Call Trace:
>    dump_stack+0x96/0xd0
>    __schedule_bug.cold+0x64/0x71
>    __schedule+0x8ea/0xac0
>    ? wait_for_completion+0x86/0x110
>    schedule+0x5f/0xd0
>    schedule_timeout+0x212/0x2a0
>    ? wait_for_completion+0x86/0x110
>    wait_for_completion+0xb0/0x110
>    __wait_rcu_gp+0x139/0x150
>    synchronize_rcu+0x79/0xf0
>    ? invoke_rcu_core+0xb0/0xb0
>    ? rcu_read_lock_any_held+0xb0/0xb0
>    blk_free_flush_queue+0x17/0x30
>    blk_mq_hw_sysfs_release+0x32/0x70
>    kobject_put+0x7d/0x1d0
>    blk_mq_release+0xbe/0xf0
>    blk_release_queue+0xb7/0x140
>    kobject_put+0x7d/0x1d0
>    disk_release+0xb0/0xc0
>    device_release+0x25/0x80
>    kobject_put+0x7d/0x1d0
>    hd_struct_free+0x4c/0xc0
>    percpu_ref_switch_to_atomic_rcu+0x1df/0x1f0
>    rcu_core+0x3fd/0x660
>    ? rcu_core+0x3cc/0x660
>    __do_softirq+0xd5/0x45e
>    ? smpboot_thread_fn+0x26/0x1d0
>    run_ksoftirqd+0x30/0x60
>    smpboot_thread_fn+0xfe/0x1d0
>    ? sort_range+0x20/0x20
>    kthread+0x11a/0x150
>    ? kthread_delayed_work_timer_fn+0xa0/0xa0
>    ret_from_fork+0x1f/0x30
> 
> "git blame" points at your commit tb7d6c3033323 ("block: fix
> use-after-free on cached last_lookup partition"), but there is
> likely more to it because it went into 5.8 and I haven't seen
> these lockups until we rebased to 5.9-rc.

The pull-the-trigger commit is actually e8c7d14ac6c3 ("block: revert back to
synchronous request_queue removal").

> 
> Could you please take a look?

Can you try the following patch?

diff --git a/block/partitions/core.c b/block/partitions/core.c
index e62a98a8eeb7..b06fc3425802 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -278,6 +278,15 @@ static void hd_struct_free_work(struct work_struct *work)
 {
 	struct hd_struct *part =
 		container_of(to_rcu_work(work), struct hd_struct, rcu_work);
+	struct gendisk *disk = part_to_disk(part);
+
+	/*
+	 * Release the reference grabbed in delete_partition, and it should
+	 * have been done in hd_struct_free(), however device's release
+	 * handler can't be done in percpu_ref's ->release() callback
+	 * because it is run via call_rcu().
+	 */
+	put_device(disk_to_dev(disk));
 
 	part->start_sect = 0;
 	part->nr_sects = 0;
@@ -293,7 +302,6 @@ static void hd_struct_free(struct percpu_ref *ref)
 		rcu_dereference_protected(disk->part_tbl, 1);
 
 	rcu_assign_pointer(ptbl->last_lookup, NULL);
-	put_device(disk_to_dev(disk));
 
 	INIT_RCU_WORK(&part->rcu_work, hd_struct_free_work);
 	queue_rcu_work(system_wq, &part->rcu_work);



Thanks,
Ming