Re: Raid5 hang in 3.14.19

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/30/2014 05:54 PM, NeilBrown wrote:
[removed alot of stuff about raid5 check hanging]
Thanks for the testing!  You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing.  It doesn't.

If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.

This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.

Neil,

With the second patch, my test has been running well for close to 5 days now,
but something odd happened yesterday.

It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect it happened silently on my
raid5 test also.

md1 : active raid1 sdh3[0] sdg3[1]
      76959296 blocks [2/2] [UU]
[=>...................] check = 9.7% (7532800/76959296) finish=108.8min speed=10628K/sec
      bitmap: 0/1 pages [0KB], 65536KB chunk



Again this is running kernel builds, read / write loops, and remove/ (re)add / check loops. At some point while removing a disk from the array, something bad happened, the errors below appeared in the log, and the ext3 filesystem remounted readonly. xfs plowed right on through. There's no evidence of any read or write errors to the member disks, either in logs or on the disks themselves, and the raid1 checks came back with zero mismatches. fsck for the ext3 fs complained a lot but the filesystem came back into a useable state and I've restarted my tests. Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around 12:27:04. This was
immediately after (or during?) removing sdh3 from md1.

Oct  4 12:26:01 xplane kernel: md: unbind<sdg3>
Oct  4 12:26:01 xplane kernel: md: export_rdev(sdg3)
Oct  4 12:26:11 xplane kernel: md: bind<sdg3>
Oct  4 12:26:11 xplane kernel: RAID1 conf printout:
Oct  4 12:26:11 xplane kernel:  --- wd:1 rd:2
Oct  4 12:26:11 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  4 12:26:11 xplane kernel:  disk 1, wo:1, o:1, dev:sdg3
Oct  4 12:26:12 xplane kernel: md: recovery of RAID array md1
Oct 4 12:26:12 xplane kernel: md: minimum _guaranteed_ speed: 10000 KB/sec/disk. Oct 4 12:26:12 xplane kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Oct 4 12:26:12 xplane kernel: md: using 128k window, over a total of 76959296k.
Oct  4 12:27:03 xplane kernel: md: md1: recovery done.
Oct  4 12:27:03 xplane kernel: RAID1 conf printout:
Oct  4 12:27:03 xplane kernel:  --- wd:2 rd:2
Oct  4 12:27:03 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  4 12:27:03 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3, disabling device. Oct 4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1 devices.
Oct  4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331468
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331469
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331470
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331471
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331472
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331473
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331474
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331475
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331476
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical block 6331477
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Aborting journal on device dm-1.
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_blocks: Journal has aborted Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: ext3_journal_start_sb: Detected aborted journal Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting filesystem read-only Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: ext3_journal_start_sb: Detected aborted journal Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_writeback_write_end: IO failure Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: Journal has aborted Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: Journal has aborted Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_orphan_add: Journal has aborted Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: Journal has aborted Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: Journal has aborted Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_dirty_inode: IO failure
Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
Oct  4 12:27:05 xplane kernel:  disk 0, wo:1, o:0, dev:sdh3
Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:14 xplane kernel: md: unbind<sdh3>
Oct  4 12:27:14 xplane kernel: md: export_rdev(sdh3)
Oct  4 12:27:24 xplane kernel: md: bind<sdh3>
Oct  4 12:27:24 xplane kernel: RAID1 conf printout:
Oct  4 12:27:24 xplane kernel:  --- wd:1 rd:2
Oct  4 12:27:24 xplane kernel:  disk 0, wo:1, o:1, dev:sdh3
Oct  4 12:27:24 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:24 xplane kernel: md: recovery of RAID array md1
Oct 4 12:27:24 xplane kernel: md: minimum _guaranteed_ speed: 10000 KB/sec/disk. Oct 4 12:27:24 xplane kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Oct 4 12:27:24 xplane kernel: md: using 128k window, over a total of 76959296k.
Oct  4 12:27:33 xplane kernel: md: md1: recovery done.
Oct  4 12:27:33 xplane kernel: RAID1 conf printout:
Oct  4 12:27:33 xplane kernel:  --- wd:2 rd:2
Oct  4 12:27:33 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  4 12:27:33 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:35 xplane kernel: md: data-check of RAID array md1
Oct 4 12:27:35 xplane kernel: md: minimum _guaranteed_ speed: 10000 KB/sec/disk. Oct 4 12:27:35 xplane kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Oct 4 12:27:35 xplane kernel: md: using 128k window, over a total of 76959296k. Oct 4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing b_committed_data

It seems like something got out of sync as the disk was being removed, but before the remove completed.

Again, this is 3.14.19 with these 8 patches:
      md/raid1: intialise start_next_window for READ case to avoid hang
      md/raid1:  be more cautious where we read-balance during resync.
      md/raid1: clean up request counts properly in close_sync()
      md/raid1: make sure resync waits for conflicting writes to complete.
md/raid1: Don't use next_resync to determine how far resync has progressed
      md/raid1: update next_resync under resync_lock.
      md/raid1: count resync requests in nr_pending.
      md/raid1: fix_read_error should act on all non-faulty devices.

      and the two patches for the check start hang.

Any ideas on what happened here?

Thanks,
Bill


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux