Thanks Neil, On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@xxxxxxxx> wrote: > On Wed, Dec 14 2016, Jinpu Wang wrote: > >> >> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. >> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. >> > > Thanks. > > I have an hypothesis. > > md_make_request() calls blk_queue_split(). > If that does split the request it will call generic_make_request() > on the first half. That will call back into md_make_request() and > raid1_make_request() which will submit requests to the underlying > devices. These will get caught on the bio_list_on_stack queue in > generic_make_request(). > This is a queue which is not accounted in nr_queued. > > When blk_queue_split() completes, 'bio' will be the second half of the > bio. > This enters raid1_make_request() and by this time the array have been > frozen. > So wait_barrier() has to wait for pending requests to complete, and that > includes the one that it stuck in bio_list_on_stack, which will never > complete now. > > To see if this might be happening, please change the > > blk_queue_split(q, &bio, q->bio_split); > > call in md_make_request() to > > struct bio *tmp = bio; > blk_queue_split(q, &bio, q->bio_split); > WARN_ON_ONCE(bio != tmp); > > If that ever triggers, then the above is a real possibility. I triggered the warning as you expected, we can confirm the bug was caused by your above hypothesis. [ 429.282235] ------------[ cut here ]------------ [ 429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262 md_set_array_sectors+0xac0/0xc30 [md_mod]() [ 429.285288] Modules linked in: raid1 ibnbd_client(O) ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci libahci libata scsi_mod crc32c_intel r8169 psmo use xhci_pci xhci_hcd [last unloaded: mlx_compat] [ 429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G O 4.4.36-1-pse rver #1 [ 429.288825] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014 [ 429.289113] 0000000000000000 ffff8801f64ff8f0 ffffffff81424486 0000000000000000 [ 429.289538] ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60 ffff8800b8f3e000 [ 429.290157] 0000000000000000 ffff8800b51f4100 ffff880234f9a700 ffff880234f9a700 [ 429.290594] Call Trace: [ 429.290743] [<ffffffff81424486>] dump_stack+0x4d/0x67 [ 429.290893] [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0 [ 429.291046] [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20 [ 429.291202] [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod] [ 429.291358] [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0 [ 429.291540] [<ffffffff813fd522>] submit_bio+0x62/0x150 [ 429.291693] [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60 [ 429.291847] [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0 [ 429.292011] [<ffffffffa0834f64>] ? ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client] [ 429.292271] [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10 [ 429.292417] [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40 [ 429.292566] [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50 [ 429.292746] [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580 [ 429.292894] [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110 [ 429.293073] [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40 [ 429.293284] [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0 [ 429.293492] [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200 [ 429.293703] [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180 [ 429.293901] [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0 [ 429.294047] [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0 [ 429.294194] [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10 [ 429.294375] [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 429.294610] ---[ end trace 25d1cece0e01494b ]--- I double checked the nr_pending on heathy leg is still 1 as before. > > Fixing the problem isn't very easy... > > You could try: > 1/ write a function in raid1.c which calls punt_bios_to_rescuer() > (which you will need to export from block/bio.c), > passing mddev->queue->bio_split as the bio_set. > > 1/ change the wait_event_lock_irq() call in wait_barrier() to > wait_event_lock_irq_cmd(), and pass the new function as the command. > > That way, if wait_barrier() ever blocks, all the requests in > bio_list_on_stack will be handled by a separate thread. > > NeilBrown I will try your sugested way to see if it fix the bug, will report back soon. -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html