On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> wrote: > Thanks Neil, > > On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@xxxxxxxx> wrote: >> On Wed, Dec 14 2016, Jinpu Wang wrote: >> >>> >>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. >>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. >>> >> >> Thanks. >> >> I have an hypothesis. >> >> md_make_request() calls blk_queue_split(). >> If that does split the request it will call generic_make_request() >> on the first half. That will call back into md_make_request() and >> raid1_make_request() which will submit requests to the underlying >> devices. These will get caught on the bio_list_on_stack queue in >> generic_make_request(). >> This is a queue which is not accounted in nr_queued. >> >> When blk_queue_split() completes, 'bio' will be the second half of the >> bio. >> This enters raid1_make_request() and by this time the array have been >> frozen. >> So wait_barrier() has to wait for pending requests to complete, and that >> includes the one that it stuck in bio_list_on_stack, which will never >> complete now. >> >> To see if this might be happening, please change the >> >> blk_queue_split(q, &bio, q->bio_split); >> >> call in md_make_request() to >> >> struct bio *tmp = bio; >> blk_queue_split(q, &bio, q->bio_split); >> WARN_ON_ONCE(bio != tmp); >> >> If that ever triggers, then the above is a real possibility. > > I triggered the warning as you expected, we can confirm the bug was > caused by your above hypothesis. > [ 429.282235] ------------[ cut here ]------------ > [ 429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262 > md_set_array_sectors+0xac0/0xc30 [md_mod]() > [ 429.285288] Modules linked in: raid1 ibnbd_client(O) > ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm > iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c > ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core > ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd > edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm > _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor > button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci > libahci libata scsi_mod crc32c_intel r8169 psmo > use xhci_pci xhci_hcd [last unloaded: mlx_compat] > [ 429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G O 4.4.36-1-pse > rver #1 > [ 429.288825] Hardware name: To be filled by O.E.M. To be filled by > O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014 > [ 429.289113] 0000000000000000 ffff8801f64ff8f0 ffffffff81424486 > 0000000000000000 > [ 429.289538] ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60 > ffff8800b8f3e000 > [ 429.290157] 0000000000000000 ffff8800b51f4100 ffff880234f9a700 > ffff880234f9a700 > [ 429.290594] Call Trace: > [ 429.290743] [<ffffffff81424486>] dump_stack+0x4d/0x67 > [ 429.290893] [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0 > [ 429.291046] [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20 > [ 429.291202] [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod] > [ 429.291358] [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0 > [ 429.291540] [<ffffffff813fd522>] submit_bio+0x62/0x150 > [ 429.291693] [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60 > [ 429.291847] [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0 > [ 429.292011] [<ffffffffa0834f64>] ? > ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client] > [ 429.292271] [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10 > [ 429.292417] [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40 > [ 429.292566] [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50 > [ 429.292746] [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580 > [ 429.292894] [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110 > [ 429.293073] [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40 > [ 429.293284] [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0 > [ 429.293492] [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200 > [ 429.293703] [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180 > [ 429.293901] [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0 > [ 429.294047] [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0 > [ 429.294194] [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10 > [ 429.294375] [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a > [ 429.294610] ---[ end trace 25d1cece0e01494b ]--- > > I double checked the nr_pending on heathy leg is still 1 as before. > >> >> Fixing the problem isn't very easy... >> >> You could try: >> 1/ write a function in raid1.c which calls punt_bios_to_rescuer() >> (which you will need to export from block/bio.c), >> passing mddev->queue->bio_split as the bio_set. >> >> 1/ change the wait_event_lock_irq() call in wait_barrier() to >> wait_event_lock_irq_cmd(), and pass the new function as the command. >> >> That way, if wait_barrier() ever blocks, all the requests in >> bio_list_on_stack will be handled by a separate thread. >> >> NeilBrown > > I will try your sugested way to see if it fix the bug, will report back soon. > Hi Neil, Sorry, bad news, with the 2 patch attached, I can still reproduce the same bug. nr_pending on healthy leg is still 1, as before. crash> struct r1conf 0xffff8800b7176100 struct r1conf { mddev = 0xffff8800b59b0000, mirrors = 0xffff88022bab7900, raid_disks = 2, next_resync = 18446744073709527039, start_next_window = 18446744073709551615, current_window_requests = 0, next_window_requests = 0, device_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, retry_list = { next = 0xffff880211b2ec40, prev = 0xffff88022819ad40 }, bio_end_io_list = { next = 0xffff880227e9a9c0, prev = 0xffff8802119c6140 }, pending_bio_list = { head = 0x0, tail = 0x0 }, pending_count = 0, wait_barrier = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8800adf3b818, prev = 0xffff88021180f7a8 } }, resync_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, nr_pending = 1675, nr_waiting = 100, nr_queued = 1673, barrier = 0, array_frozen = 1, fullsync = 0, recovery_disabled = 1, poolinfo = 0xffff88022c80f640, r1bio_pool = 0xffff88022b8b6a20, r1buf_pool = 0x0, tmppage = 0xffffea0008a90c80, thread = 0x0, cluster_sync_low = 0, cluster_sync_high = 0 } kobj = { name = 0xffff88022b7194a0 "dev-loop1", entry = { next = 0xffff880231495280, prev = 0xffff880231495280 }, parent = 0xffff8800b59b0050, kset = 0x0, ktype = 0xffffffffa0564060 <rdev_ktype>, sd = 0xffff8800b6510960, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802314952c8, prev = 0xffff8802314952c8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 0, journal_tail = 0 }, nr_pending = { counter = 1 }, -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss
From e7adbbb1a8d542ea68ada5996e0f9ffe87c479b6 Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> Date: Wed, 14 Dec 2016 11:26:23 +0100 Subject: [PATCH 1/2] block: export punt_bios_to_rescuer We need it later in raid1 Suggested-by: Neil Brown <neil.brown@xxxxxxxx> Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> --- block/bio.c | 3 ++- include/linux/bio.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 46e2cc1..f6a250d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -354,7 +354,7 @@ static void bio_alloc_rescue(struct work_struct *work) } } -static void punt_bios_to_rescuer(struct bio_set *bs) +void punt_bios_to_rescuer(struct bio_set *bs) { struct bio_list punt, nopunt; struct bio *bio; @@ -384,6 +384,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs) queue_work(bs->rescue_workqueue, &bs->rescue_work); } +EXPORT_SYMBOL(punt_bios_to_rescuer); /** * bio_alloc_bioset - allocate a bio for I/O diff --git a/include/linux/bio.h b/include/linux/bio.h index 42e4e3c..6256ba7 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -479,6 +479,7 @@ extern void bio_advance(struct bio *, unsigned); extern void bio_init(struct bio *); extern void bio_reset(struct bio *); void bio_chain(struct bio *, struct bio *); +void punt_bios_to_rescuer(struct bio_set *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, -- 2.7.4
From 2ad4cc5e8b5d7ec9db7a6fffaa2fdcd5f20419bf Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> Date: Wed, 14 Dec 2016 11:35:52 +0100 Subject: [PATCH 2/2] raid1: fix deadlock Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> --- drivers/md/raid1.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 478223c..61dafb1 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -190,6 +190,11 @@ static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio) } } +static void raid1_punt_bios_to_rescuer(struct r1conf *conf) +{ + punt_bios_to_rescuer(conf->mddev->queue->bio_split); +} + static void free_r1bio(struct r1bio *r1_bio) { struct r1conf *conf = r1_bio->mddev->private; @@ -871,14 +876,15 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio) * that queue to allow conf->start_next_window * to increase. */ - wait_event_lock_irq(conf->wait_barrier, + wait_event_lock_irq_cmd(conf->wait_barrier, !conf->array_frozen && (!conf->barrier || ((conf->start_next_window < conf->next_resync + RESYNC_SECTORS) && current->bio_list && !bio_list_empty(current->bio_list))), - conf->resync_lock); + conf->resync_lock, + raid1_punt_bios_to_rescuer(conf)); conf->nr_waiting--; } -- 2.7.4