Symptom: A RAID 1 array ends up with two threads (flush and raid1) stuck in D state forever. The array is inaccessible and the host must be restarted to restore access to the array. I have some scripted workloads that reproduce this within a maximum of a couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the stacks are different from this bug (as it's EOL maybe of interest in bisecting the problem?). Details: [flush-9:16] [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1] [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1] [<ffffffff813653c3>] md_make_request+0xd3/0x200 [<ffffffff811f494a>] generic_make_request+0xca/0x100 [<ffffffff811f49f9>] submit_bio+0x79/0x160 [<ffffffff811808f8>] submit_bh+0x128/0x200 [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330 [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100 [<ffffffff81183255>] block_write_full_page+0x15/0x20 [<ffffffff81187908>] blkdev_writepage+0x18/0x20 [<ffffffff810f73b7>] __writepage+0x17/0x40 [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0 [<ffffffff810f8881>] generic_writepages+0x51/0x80 [<ffffffff810f88d0>] do_writepages+0x20/0x40 [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160 [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430 [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0 [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0 [<ffffffff8117b171>] wb_do_writeback+0x241/0x250 [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250 [<ffffffff8106414e>] kthread+0xce/0xe0 [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff [md16-raid1] [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1] [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1] [<ffffffff813675a7>] md_thread+0x117/0x150 [<ffffffff8106414e>] kthread+0xce/0xe0 [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff Both processes end up in wait_event_lock_irq() waiting for favorable conditions in the struct r1conf to proceed. These conditions obviously seem to never arrive. I placed printk statements in freeze_array() and wait_barrier() directly before calling their respective wait_event_lock_irq() and this is an example output: Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) The flush seems to come after the attempt to handle the read error. I believe the incrementing nr_waiting comes from the multiple read processes I have going as they get stuck behind this deadlock. majienpeng may have been referring to this condition on the linux-raid list a few days ago (http://www.spinics.net/lists/raid/msg42150.html) when he stated "Above code is what's you said. But it met read-error, raid1d will blocked for ever. The reason is freeze_array" Environment: A RAID 1 array built on top of two multipath maps Devices underlying the maps are iscsi disks exported from a linux SAN Multipath is configured to queue i/o if no path for 5 retries (polling interval 5) before failing and causing the mirror to degrade. Reproducible on kernels 3.6.11 and up (x86_64) To reproduce, I create 10 raid 1 arrays on a client built on mpio over iscsi. For each of the arrays, I start the following to create some writes: $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6; sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done I then start 10 processes each doing this to create some reads: while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN of=/dev/null bs=1024 count=50000; done In a screen I start a script running in a loop that monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on the failed disk. (This is mainly so that I get more than one crack at reproducing the bug since the timing never happens just right on one try.) Now on one of the iscsi servers I start a loop that sleeps a random amount of time between 1-5 minutes then stops the export, sleeps again and then restores the export. The net effect of all this is that the disks on the initiator will queue i/o for a while when the export is off and eventually fail. In most cases this happens gracefully and i/o will go to the remaining disk. In the failure case we get the behavior described here. The test is setup to restore the arrays and try again until the bug is hit. We do see this situation in production with relative frequency so it's not something that happens only in theory under artificial conditions. Any help or insight here is much appreciated. Tregaron Bayly Systems Administrator Bluehost, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html