Re: [PATCH] MD: Quickly return errors if too many devices have failed.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 13 Mar 2013 12:29:24 -0500 Jonathan Brassow <jbrassow@xxxxxxxxxx>
wrote:

> Neil,
> 
> I've noticed that when too many devices fail in a RAID arrary that
> addtional I/O will hang, yielding an endless supply of:
> Mar 12 11:52:53 bp-01 kernel: Buffer I/O error on device md1, logical block 3
> Mar 12 11:52:53 bp-01 kernel: lost page write due to I/O error on md1
> Mar 12 11:52:53 bp-01 kernel: sector=800 i=3           (null)           (null)  
>          (null)           (null) 1

This is the third report in as many weeks that mentions that WARN_ON.
The first two where quite different causes.
I think this one is the same as the first one, which means it would be fixed
by  
      md/raid5: schedule_construction should abort if nothing to do.

which is commit 29d90fa2adbdd9f in linux-next.

> Mar 12 11:52:53 bp-01 kernel: ------------[ cut here ]------------
> Mar 12 11:52:53 bp-01 kernel: WARNING: at drivers/md/raid5.c:354 init_stripe+0x2d4/0x370 [raid456]()

> 
> Are other people seeing this, or is this an artifact of the way I am killing
> devices ('echo offline > /sys/block/$dev/device/state')?

That is a perfectly good way to kill a device.

> 
> I would prefer to get immediate errors if nothing can be done to satisfy the
> request and I've been thinking of something like the attached patch.  The
> patch below is incomplete.  It does not take into account any reshaping that
> is going on, nor does it try to figure out if a mirror set in RAID10 has died;
> but I hope it gets the basic idea across.
> 
> Is this a good way to handle this situation, or am I missing something?

I think we do get immediate errors (once all bugs are fixed).
Your patch does extra work for every request which is only of value if the
array has failed - and it really doesn't make sense to optimise for a failed
array.
The current approach is to just try to satisfy a request and once we find
that we need to do something that is impossible - return an error at that
point.  I think that is best.

Can you try the commit I identified and see if it makes the problem go away?

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux