Re: safe segmenting of conflicting changes

Doug Ledford <dledford@xxxxxxxxxx> · Tue, 27 Apr 2010 15:29:31 -0400

On 04/27/2010 02:04 PM, Phillip Susi wrote:
> On 4/27/2010 1:27 PM, Doug Ledford wrote:
>> And the state "removed" is labeled as such because the device has been
>> removed from the list of slave devices that the kernel keeps.  Nothing
>> more.  You are reading into it things that weren't intended.  What you
>> are reading into it might even be a reasonable interpretation, but it's
>> not the actual interpretation.
> 
> Then it is a useless and non existent state.

Actually, that's more true than you realize.  Technically, there is no
"removed" state (at least not that's stored on the removed drive).
There are working, failed, and removed discs as far as the slot array
for the md device is concerned.  The usage of REMOVED in the device
states is limited to marking slots in the constituent device list that
have neither a working disc nor a failed disc associated with it.  It
has nothing to do with whether or not mdadm --remove was ever called.

>  Whether the kernel
> currently has the disk open or not is purely ephemeral; there is no
> reason to bother recording it in the superblock.

We do record it, but mainly in the form of our disk count.  We keep
track of active discs, failed discs, and spare discs.  Whether we have
the disc open and in our device list changes what our failed disc count
would be.  On removal, we decrement that failed disc count.  The slot
that the disc used to occupy gets changed from being failed to removed
simply to indicate there is no rdev associated with that slot.  That is
the information we record.

>  Why bother recording
> that bit in the superblock if you are just going to ignore it?  It is
> there, so it should not be ignored.

You are making assumptions about the use of that bit.  We don't ignore
it, we just don't use it the way you think we should.

>> mdadm /dev/md0 -f /dev/sdc1 -r /dev/sdc1; mdadm --zero-superblock /dev/sdc1
>>
>> and that should be sufficient to satisfy your needs.  If we race between
>> the remove and the zero-superblock with something like a power failure
>> then obviously so little will have changed that you can simply repeat
>> the procedure until you successfully complete it without a power failure.
> 
> Except that now I can't manually re-add the disk to the array.

Sure you can: mdadm /dev/md0 -a /dev/sdc1

>> This is intentional.  A remove event merely triggers a kernel error
>> cycle on the target device.  We don't differentiate between a user
>> initiated remove and one that's the result of catastrophic disc failure.
> 
> Why not?  Seems to be because of a broken definition of the removed state.

Not at all.  You make lots of assumptions, some of them rather silly.
We don't differentiate between the two because the best way to ensure
you don't have errors in your device failure code path is to actually
use it other than just when a device fails.  It was an intentional
design decision that when the set faulty and hot remove ioctl calls were
added to the raid stack, that they would actually call into the low
level command handling routines and activate the actual error paths that
would be followed if a command came back from a device with some sort of
error.  By doing that, we are actually testing the command failure path
every time we instigate a hot fail using mdadm.  That was intentional,
and it had nothing to do with the definition of removed state.

>>  However, trying to access a dead disc causes all sorts of bad behavior
>> on a real running system with a real disc failure, so once we know a
>> disc is bad and we are kicking it from the array, we only try to write
>> that data to the good discs so we aren't hosing the system.
> 
> That is what failed is for.  Once failed, you can't write to the disk
> anymore.  If it isn't failed, you should be able to remove it, and the
> superblock should be updated prior to detaching.

That's all fine and dandy for the MD raid stack, version 2.  Feel free
to start writing said stack with your preferred semantic definitions in
place.

>> We are specifically going in the opposite direction here.  We *want* to
>> automatically readd removed devices because we are implementing
>> automatic removal on hot unplug, which means we want automatic addition
>> on hot plug.
> 
> Again, this is because you are conflating removed, and failed.

No, because we only have one failed state: failed.  Removed as far as
mdadm is concerned is an action, and removed as far as the kernel is
concerned just means "I don't have any devices opened for this slot".
It's not a valid state for a device.

>  If you
> treat them separately then you don't have this problem.  Failed happens
> automatically, so can be undone automatically, removed does not.
> 
>> Again, we are back to the fact that you are interpreting removed to be
>> something it isn't.  We can argue about this all day long, but that
>> option has had a specific meaning for long enough, and has been around
>> long enough, that it can't be changed now without breaking all sorts of
>> back compatibility.
> 
> This may be true for the --remove switch, but as it stands right now,
> the removed flag in the superblock is utterly meaningless.  If you want
> --remove to continue to behave the way it does, then you could add a
> --really-remove command to set the removed flag, though this seems a
> little silly.

I'm sure Neil will take patches.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

Attachment:
signature.asc

Description: OpenPGP digital signature