Re: [PATCH] DM RAID: Add ability to restore transiently failed devices on resume

NeilBrown <neilb@xxxxxxx> · Mon, 22 Apr 2013 10:43:26 +1000

On Thu, 11 Apr 2013 15:27:03 -0500 Jonathan Brassow <jbrassow@xxxxxxxxxx>
wrote:

> DM RAID: Add ability to restore transiently failed devices on resume
> 
> This patch adds code to the resume function to check over the devices
> in the RAID array.  If any are found to be marked as failed and their
> superblocks can be read, an attempt is made to reintegrate them into
> the array.  This allows the user to refresh the array with a simple
> suspend and resume of the array - rather than having to load a
> completely new table, allocate and initialize all the structures and
> throw away the old instantiation.
> 
> Signed-off-by: Jonathan Brassow <jbrassow@xxxxxxxxxx>

Is this really a good idea?  
Just because you can read the superblock, that doesn't mean you can read any
other block in the array.

If the auto-recovery were optional I might consider it, but having it be
enforced on every suspend/resume just doesn't seem safe to me.

Have I misunderstood?

Is this that hard to achieve in user-space?

> +			if (test_bit(Faulty, &r->flags) && r->sb_page &&
> +			    !read_disk_sb(r, r->sb_size)) {

I know it is fairly widely used, especially for 'strcmp', but I absolutely
despise this construct.
  !read_disk_sb()
sound like
  "not read_disk_sb" or "could not read the disk's superblock" but it
actually means "did not fail to read the disk's superblock" - the exact
reverse.
If 0 means success, I like to see an explicit test for 0:
   read_disk_sb() == 0

NeilBrown

Attachment:
signature.asc

Description: PGP signature