Re: [BUG] MD/RAID1 hung forever on freeze_array

Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> · Wed, 14 Dec 2016 15:49:23 +0100




On Wed, Dec 14, 2016 at 1:13 PM, Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> wrote:
> On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
> <jinpu.wang@xxxxxxxxxxxxxxxx> wrote:
>> Thanks Neil,
>>
>> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@xxxxxxxx> wrote:
>>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>>
>>>>
>>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>>
>>>
>>> Thanks.
>>>
>>> I have an hypothesis.
>>>
>>> md_make_request() calls blk_queue_split().
>>> If that does split the request it will call generic_make_request()
>>> on the first half. That will call back into md_make_request() and
>>> raid1_make_request() which will submit requests to the underlying
>>> devices.  These will get caught on the bio_list_on_stack queue in
>>> generic_make_request().
>>> This is a queue which is not accounted in nr_queued.
>>>
>>> When blk_queue_split() completes, 'bio' will be the second half of the
>>> bio.
>>> This enters raid1_make_request() and by this time the array have been
>>> frozen.
>>> So wait_barrier() has to wait for pending requests to complete, and that
>>> includes the one that it stuck in bio_list_on_stack, which will never
>>> complete now.
>>>
>>> To see if this might be happening, please change the
>>>
>>>         blk_queue_split(q, &bio, q->bio_split);
>>>
>>> call in md_make_request() to
>>>
>>>         struct bio *tmp = bio;
>>>         blk_queue_split(q, &bio, q->bio_split);
>>>         WARN_ON_ONCE(bio != tmp);
>>>
>>> If that ever triggers, then the above is a real possibility.
>>
>> I triggered the warning as you expected, we can confirm the bug was
>> caused by your above hypothesis.
>> [  429.282235] ------------[ cut here ]------------
>> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
>> md_set_array_sectors+0xac0/0xc30 [md_mod]()
>> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
>> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
>> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
>> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
>> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
>> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
>> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
>> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
>> libahci libata scsi_mod crc32c_intel r8169 psmo
>> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
>> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
>> rver #1
>> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
>> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
>> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
>> 0000000000000000
>> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
>> ffff8800b8f3e000
>> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
>> ffff880234f9a700
>> [  429.290594] Call Trace:
>> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
>> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
>> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
>> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
>> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
>> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
>> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
>> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
>> [  429.292011]  [<ffffffffa0834f64>] ?
>> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
>> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
>> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
>> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
>> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
>> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
>> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
>> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
>> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
>> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
>> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
>> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
>> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
>> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>>
>> I double checked the nr_pending on heathy leg is still 1 as before.
>>
>>>
>>> Fixing the problem isn't very easy...
>>>
>>> You could try:
>>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>>   (which you will need to export from block/bio.c),
>>>   passing mddev->queue->bio_split as the bio_set.
>>>
>>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>>
>>> That way, if wait_barrier() ever blocks, all the requests in
>>> bio_list_on_stack will be handled by a separate thread.
>>>
>>> NeilBrown
>>
>> I will try your sugested way to see if it fix the bug, will report back soon.
>>

Hi Neil,

I found a old mail thread below
http://www.spinics.net/lists/raid/msg52792.html

Likely Alex is trying to fix same bug, right?
in one reply you suggested to modify the call in make_request

@@ -1207,7 +1207,8 @@ read_again:
                                sectors_handled;
                        goto read_again;
                } else
-                       generic_make_request(read_bio);
+                       reschedule_retry(r1_bio);
                return;
        }


I append above change, it looks fix the bug, I've run same tests over
one hour,  no hung task anymore.

Do you think this is right fix? Do we still need the change you
suggested with punt_bios_to_rescuer?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@xxxxxxxxxxxxxxxx
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html