Re: how to handle bad sectors in md control areas?

NeilBrown <neilb@xxxxxxx> · Mon, 3 Mar 2014 08:38:38 +1100

On Wed, 26 Feb 2014 19:16:30 +1100 Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx>
wrote:

> In another thread I investigated an issue with a pending sector, which now seems to be
> a bad sector inside the md header (the first 256k sectors).
> 
> The question now remaining: what is the correct approach to fixing this problem?

You could "fix" it by simply redefining it not to be a problem.
If you never get an error then is there a problem?

> 
> The more general issue is what to do when any md control area develops an error. does
> all data have redundant copies?

We don't currently have any redundancy with a device.  Of course most
metadata is replicated across all devices so there is redundancy in that
sense.
I have occasionally thought of creating a v1.3 metadata which duplicates the
superblock at both end of the device.  Never quite seemed worth the effort
though.
The write-intent-bitmap would be a lot more expensive to duplicate but as it
is identical on all devices, the  gain would be small (though there are cases
where it would be useful).

The bad-block log probably should be duplicated.  That wouldn't be too
expensive and  might have  some real benefits....

> 
> The simple way that I see is to fail the member, remove it, clear it (at least
> --zero-superblock and write to the bad sector) and then add it. However this
> will incur a full resync (about 10 hours).
> 
> Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
> should allow a create with --assume-clean (which will write to the bad sector and
> "fix" it), but the doco discourages this.

Why do you think this will write the bad sector?
When you --create and array it doesn't write too all the space on the array.
It only writes what it needs to.  So the superblock, the write-intent-bitmap
and maybe the bad-block-log.  But nothing else.
And most of that gets written during normal array activity.

So if a block remains unwritten after stop/start/check, you can be fairy sure
it isn't used at all, so you can ignore it.  Or write zeros to it.

> 
> Also, it is not impossible to think that the specific bad sector (toward the end
> of the header) is not actually used today, meaning I can live with it as is, or
> write anything to the bad sector as it does not get used. Too involved though.
> 
> A bad sector in the data area should be fixed with a standard raid 'check' action.
> 
> TIA
> 

NeilBrown
Attachment:
signature.asc

Description: PGP signature