Re: Raid5 hang in 3.14.19

NeilBrown <neilb@xxxxxxx> · Tue, 14 Oct 2014 12:42:45 +1100

On Sun, 05 Oct 2014 11:05:08 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx>
wrote:

> On 09/30/2014 05:54 PM, NeilBrown wrote:
> > [removed alot of stuff about raid5 check hanging]
> > Thanks for the testing!  You have included enough information.
> > I didn't really like that 'sync_starting' variable when I wrote the patch,
> > but it seemed do the right thing.  It doesn't.
> >
> > If md_check_recovery() runs again immediately after scheduling the sync
> > thread to run, it will not have set sync_starting but will find ->sync_thread
> > is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
> > flag is still clear and ->sync_thread is not NULL so it will try to stop the
> > thread, which deadlocks.
> >
> > This patch on top of what you have should fix it... but I might end up
> > redoing the logic a bit.
> >
> Neil,
> 
> With the second patch, my test has been running well for close to 5 days 
> now,

That is good news - thanks.

> but something odd happened yesterday.
> 
> It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect 
> it happened silently on my
> raid5 test also.
> 
> md1 : active raid1 sdh3[0] sdg3[1]
>        76959296 blocks [2/2] [UU]
>        [=>...................]  check =  9.7% (7532800/76959296) 
> finish=108.8min speed=10628K/sec
>        bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> 
> 
> Again this is running kernel builds, read / write loops, and remove/ 
> (re)add / check loops.
> At some point while removing a disk from the array, something bad 
> happened, the errors below
> appeared in the log, and the ext3 filesystem remounted readonly. xfs 
> plowed right on through.
> There's no evidence of any read or write errors to the member disks, 
> either in logs or on the
> disks themselves, and the raid1 checks came back with zero mismatches. 
> fsck for the ext3 fs
> complained a lot but the filesystem came back into a useable state and 
> I've restarted my tests.
> Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around 
> 12:27:04. This was
> immediately after (or during?) removing sdh3 from md1.
> 
> Oct  4 12:26:01 xplane kernel: md: unbind<sdg3>
> Oct  4 12:26:01 xplane kernel: md: export_rdev(sdg3)
> Oct  4 12:26:11 xplane kernel: md: bind<sdg3>

Removed and re-added sdg3

> Oct  4 12:26:11 xplane kernel: RAID1 conf printout:
> Oct  4 12:26:11 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:26:11 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
> Oct  4 12:26:11 xplane kernel:  disk 1, wo:1, o:1, dev:sdg3

->add_disk included sdg3 in the array not in-sync.

> Oct  4 12:26:12 xplane kernel: md: recovery of RAID array md1
> Oct  4 12:26:12 xplane kernel: md: minimum _guaranteed_  speed: 10000 
> KB/sec/disk.
> Oct  4 12:26:12 xplane kernel: md: using maximum available idle IO 
> bandwidth (but not more than 200000 KB/sec) for recovery.
> Oct  4 12:26:12 xplane kernel: md: using 128k window, over a total of 
> 76959296k.

Recovery starts

> Oct  4 12:27:03 xplane kernel: md: md1: recovery done.

Recovery finishes after 51 seconds.

> Oct  4 12:27:03 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:03 xplane kernel:  --- wd:2 rd:2
> Oct  4 12:27:03 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
> Oct  4 12:27:03 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

->spare_active() marks sdg3 at fully in-sync (not long "wo" - write only).

> Oct  4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3, 
> disabling device.

sdh3 fails a second later.  Presumably this was requested by your testing
harness.

> Oct  4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1 
> devices.

raid.c:error() confirms the device will not be ignored.

> Oct  4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331468
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3

Write errors start happening.

You should only get a write error if no writes successfully completed to
in_sync, non-faulty devices.
It is possible that the write to sdg3 completed before it was marked in-sync,
and the write to sdh3 completed after it was marked as faulty.
How long after recovery completes do you fail the next device?
The logs suggest it is the next second, which could be anywhere from 1msec
to 1998 msecs.

NeilBrown

> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331469
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331470
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331471
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331472
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331473
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331474
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331475
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331476
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331477
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Aborting journal on device dm-1.
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_blocks: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: 
> ext3_journal_start_sb: Detected aborted journal
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting 
> filesystem read-only
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: 
> ext3_journal_start_sb: Detected aborted journal
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in 
> ext3_writeback_write_end: IO failure
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_orphan_add: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in 
> ext3_dirty_inode: IO failure
> Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:27:05 xplane kernel:  disk 0, wo:1, o:0, dev:sdh3
> Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

raid1_remove_disk is called to disconnect the device from the array.

> Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

which it has now done.

> Oct  4 12:27:14 xplane kernel: md: unbind<sdh3>
> Oct  4 12:27:14 xplane kernel: md: export_rdev(sdh3)
> Oct  4 12:27:24 xplane kernel: md: bind<sdh3>
> Oct  4 12:27:24 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:24 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:27:24 xplane kernel:  disk 0, wo:1, o:1, dev:sdh3
> Oct  4 12:27:24 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> Oct  4 12:27:24 xplane kernel: md: recovery of RAID array md1
> Oct  4 12:27:24 xplane kernel: md: minimum _guaranteed_  speed: 10000 
> KB/sec/disk.
> Oct  4 12:27:24 xplane kernel: md: using maximum available idle IO 
> bandwidth (but not more than 200000 KB/sec) for recovery.
> Oct  4 12:27:24 xplane kernel: md: using 128k window, over a total of 
> 76959296k.
> Oct  4 12:27:33 xplane kernel: md: md1: recovery done.
> Oct  4 12:27:33 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:33 xplane kernel:  --- wd:2 rd:2
> Oct  4 12:27:33 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
> Oct  4 12:27:33 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> Oct  4 12:27:35 xplane kernel: md: data-check of RAID array md1
> Oct  4 12:27:35 xplane kernel: md: minimum _guaranteed_  speed: 10000 
> KB/sec/disk.
> Oct  4 12:27:35 xplane kernel: md: using maximum available idle IO 
> bandwidth (but not more than 200000 KB/sec) for data-check.
> Oct  4 12:27:35 xplane kernel: md: using 128k window, over a total of 
> 76959296k.
> Oct  4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing 
> b_committed_data
> 
> It seems like something got out of sync as the disk was being removed, 
> but before the remove completed.
> 
> Again, this is 3.14.19 with these 8 patches:
>        md/raid1: intialise start_next_window for READ case to avoid hang
>        md/raid1:  be more cautious where we read-balance during resync.
>        md/raid1: clean up request counts properly in close_sync()
>        md/raid1: make sure resync waits for conflicting writes to complete.
>        md/raid1: Don't use next_resync to determine how far resync has 
> progressed
>        md/raid1: update next_resync under resync_lock.
>        md/raid1: count resync requests in nr_pending.
>        md/raid1: fix_read_error should act on all non-faulty devices.
> 
>        and the two patches for the check start hang.
> 
> Any ideas on what happened here?
> 
> Thanks,
> Bill
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment:
pgp1MIlHCmrPq.pgp

Description: OpenPGP digital signature