Re: [PATCH] MD: Quickly return errors if too many devices have failed.

NeilBrown <neilb@xxxxxxx> · Mon, 18 Mar 2013 10:49:05 +1100

On Wed, 13 Mar 2013 12:29:24 -0500 Jonathan Brassow <jbrassow@xxxxxxxxxx>
wrote:

> Neil,
> 
> I've noticed that when too many devices fail in a RAID arrary that
> addtional I/O will hang, yielding an endless supply of:
> Mar 12 11:52:53 bp-01 kernel: Buffer I/O error on device md1, logical block 3
> Mar 12 11:52:53 bp-01 kernel: lost page write due to I/O error on md1
> Mar 12 11:52:53 bp-01 kernel: sector=800 i=3           (null)           (null)  
>          (null)           (null) 1

This is the third report in as many weeks that mentions that WARN_ON.
The first two where quite different causes.
I think this one is the same as the first one, which means it would be fixed
by  
      md/raid5: schedule_construction should abort if nothing to do.

which is commit 29d90fa2adbdd9f in linux-next.

> Mar 12 11:52:53 bp-01 kernel: ------------[ cut here ]------------
> Mar 12 11:52:53 bp-01 kernel: WARNING: at drivers/md/raid5.c:354 init_stripe+0x2d4/0x370 [raid456]()

> 
> Are other people seeing this, or is this an artifact of the way I am killing
> devices ('echo offline > /sys/block/$dev/device/state')?

That is a perfectly good way to kill a device.

> 
> I would prefer to get immediate errors if nothing can be done to satisfy the
> request and I've been thinking of something like the attached patch.  The
> patch below is incomplete.  It does not take into account any reshaping that
> is going on, nor does it try to figure out if a mirror set in RAID10 has died;
> but I hope it gets the basic idea across.
> 
> Is this a good way to handle this situation, or am I missing something?

I think we do get immediate errors (once all bugs are fixed).
Your patch does extra work for every request which is only of value if the
array has failed - and it really doesn't make sense to optimise for a failed
array.
The current approach is to just try to satisfy a request and once we find
that we need to do something that is impossible - return an error at that
point.  I think that is best.

Can you try the commit I identified and see if it makes the problem go away?

Thanks,
NeilBrown

Attachment:
signature.asc

Description: PGP signature