Re: Badstripe proposal (was Re: Bad blocks are killing us!)

Maurilio Longo <maurilio.longo@xxxxxxxxx> · Thu, 18 Nov 2004 10:59:58 +0100

David and others,

I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a
bad-block management layer, maybe it could be merged inside md to have write
errors management as well, maybe :)

regards.

David Greaves ha scritto:

> Just for discussion...
>
> Proposal:
> md devices to have a badstripe table and space for re-allocation
>
> Benefits:
> Allows multiple block level failures on any combination of component md
> devices provided parity is not compromised.
> Zero impact on performance in non-degraded mode.
> No need for scanning (although it may be used as a trigger)
> Works for all md personalities.
>
> Overview:
> Provide an 'on or off-array' store for any stripes impacted by block
> level failure.
> Unlike a disk's badblock allocation this would be a temporary store
> since we'd insist on the underlying devices recovering fully from the
> problem before restoring full health.
> This allows us to cope transiently and, in the event of non-recoverable
> errors, until the disk is replaced.
>
> Downsides:
> Resync'ing with multiple failing drives is more complex (but more resilient)
> Some kind of store handler is needed.
>
> Description:
> I've structured this to look at the md driver, the userspace daemon, the
> store, failing drives and replacing and resync'ing drives.
>
> md:
> For normal md access the badstripe list has no entries and is ignored. A
> badstripe size check is required prior to each stripe access.
>
> If a write error occurs, rewrite the stripe to a store noting, and
> marking bad, the originating (faulty) stripe (and offending
> device/block) in the badstripe table. The device is marked 'failing'.
> If a read error occurs, attempt to reconstruct the stripe from the other
> devices then follow the write error path.
>
> For normal md access against stripes appearing in the badstripe list:
> * Lock the badstripe table against the daemon (and other md threads)
> * Check the stripe is still in the bad stripe list
> * If not then the userland daemon fixed it. Release lock. Carry on as
> normal.
> * If so then read/write from the reserved area.
> * Release badstripe lock.
>
> Daemon:
> A userland daemon could examine the reserved area, attempt a repair on a
> faulty stripe and, if it succeeds, could restore the stripe and mark the
> badstripe entry as clean thus freeing up the reserved area and restoring
> perfect health.
> The daemon would:
> * lock the badstripe table against md
> * write the stripe back to the previously faulty area which shouldn't
> need locking against md since it's "not in use"
> * correct the badstripe table
> * release the lock
> If the daemon fails then the badstripe entry is marked as unrecoverable.
>
> If the daemon has failed to correct the error (unrecoverable in the
> badstripe table) then the drive should be kept as failing (not faulty)
> and should be replaced. The intention is to allow a failing drive to
> continue to be used in the event of a subsequent bad drive event.
>
> The Store:
> This could be reserved stripes at the start (?) of the component devices
> read/written using the current personality. Alternatively it could be a
> filesystem level store (possibly remote, on a resilient device or just
> in /tmp).
>
> Failing drives:
>  From a reading point of view it seems possible to treat a failing drive
> as a faulty drive - until the event of another read failure on another
> drive. In that case the read error case above could still access the
> failing drive to attempt a recovery. This may help in the event of
> recovery from a failing drive where you want to minimise load against
> it. It may not be worthwhile.
> Writing would still have to continue to maintain sync.
>
> Drive replacement + resync:
> If multiple devices go 'failing' then how are they removed (since they
> are all in use). A spare needs to be added and then the resync code
> needs to ensure that one of the failing disks is synced to the spare.
> Then the failing disk is made faulty and then removed.
>
> This could be done by having a progression:
> failing
> failing-pending-remove
> faulty
>
> As I said above a failing drive is not used for reads, only for writes.
> Presumably a drive that is sync'ing is used for writes but not reads.
> So if we add a good drive and mark it syncing and simultaneously mark
> the drive it replaces failing-pending-remove then the f-p-r drive won't
> be written to but is available for essential reads until the new drive
> is ready.
>
> Some thoughts:
> How much overhead is involved in checking each stripe read/write address
> against a *small* bad-stripe table. Probably none because most of the
> time, for a healthy md, the number of entries is 0.
>
> Does the temporary space even have to be in the md space? Would it be
> easier to make it a file (not in the filesystem on the md device!!) This
> avoids any messing with stripe offsets etc.
>
> I don't claim to understand md's locking - the stuff above is a
> simplistic start on the additional locking related to moving stuff in
> and out of the badstripes area. I don't know where contention is handled
> - md driver or fs.
>
> This is essentially only useful for single (or at least 'few') badblock
> errors - is that a problem worth solving (from the thread title I assume
> so).
>
> How intrusive is this? I can't really judge. It mainly feels like error
> handling - and maybe handing off to a reused/simplified loopback-like
> device could handle 'hits' against the reserved area.
>
> I'm only starting to read the code/device drivers books etc etc so if
> I'm talking rubbish then I'll apologise for your time and keep quiet :)
>
> David
>
> Guy wrote:
>
> >Neil said:
> >"I hadn't thought about that yet.  I suspect there would be little
> >point in doing a scan when there was no redundancy.  However a scan on
> >a degraded raid6 that could still safely loose one drive would
> >probably make sense."
> >
> >I agree.
> >
> >Also a RAID1 with 2 or more working devices.  Don't forget, some people have
> >3 or more devices on the RAID1 arrays.  From what I have read anyway.
> >
> >Thanks,
> >Guy
> >
> >-----Original Message-----
> >From: Neil Brown [mailto:neilb@xxxxxxxxxxxxxxx]
> >Sent: Tuesday, November 16, 2004 6:04 PM
> >To: Guy
> >Cc: linux-raid@xxxxxxxxxxxxxxx
> >Subject: RE: Bad blocks are killing us!
> >
> >On Tuesday November 16, bugzilla@xxxxxxxxxxxxxxxx wrote:
> >
> >
> >>This sounds great!
> >>
> >>But...
> >>
> >>2/  Do you intend to create a user space program to attempt to correct the
> >>bad block and put the device back in the array automatically?  I
> >>hope so.
> >>
> >>
> >
> >Definitely.  It would be added to the functionality of "mdadm --monitor".
> >
> >
> >
> >>If not, please consider correcting the bad block without kicking the
> >>
> >>
> >device
> >
> >
> >>out.  Reason:  Once the device is kicked out, a second bad block on
> >>
> >>
> >another
> >
> >
> >>device is fatal to the array.  And this has been happening a lot
> >>lately.
> >>
> >>
> >
> >This one of several things that makes it "a bit less trivial" than
> >simply using the bitmap stuff.  I will keep your comment in mind when
> >I start looking at this in more detail.  Thanks.
> >
> >
> >
> >>3/  Maybe don't do the bad block scan if the array is degraded.  Reason:
> >>
> >>
> >If
> >
> >
> >>a bad block is found, that would kick out a second disk, which is fatal.
> >>Since the stated purpose of this is to "check parity/copies are correct"
> >>then you probably can't do this anyway.  I just want to be sure.  Also, if
> >>during the scan, if a device is kicked, the scan should pause or abort.
> >>
> >>
> >The
> >
> >
> >>scan can resume once the array has been corrected.  I would be happy if
> >>
> >>
> >the
> >
> >
> >>scan had to be restarted from the start.  So a pause or abort is fine with
> >>me.
> >>
> >>
> >
> >I hadn't thought about that yet.  I suspect there would be little
> >point in doing a scan when there was no redundancy.  However a scan on
> >a degraded raid6 that could still safely loose one drive would
> >probably make sense.
> >
> >NeilBrown
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@xxxxxxxxxxxxxxx
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
 __________
|  |  | |__| md2520@xxxxxxxxx
|_|_|_|____| Team OS/2 Italia

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html