Re: Fast (intelligent) raid1

"Peter T. Breuer" <ptb@it.uc3m.es> · Mon, 17 Feb 2003 14:17:01 +0100 (MET)

Hi!

"A month of sundays ago Neil Brown wrote:"
> Rather than having several bitmaps, have just one.  Normally it is
> full of zero and isn't touched.

I'm having a little trouble interpreting your suggestion, so I'll work
through it here, and you can maybe correct my understanding.

In the first place, I can say that having ONE bitmap per raid composite
device is OK with me and probably is a worthwhile simplification over
having ONE per mirror component, since the normal case will be a two
component mirror and a mark on the bitmap will indicate unambiguously
where the fault is if the mirror is operational at all.

Or will it .. while the mirror is resyncing there may be a doubt. But
then it's currently subject to some imprecisions at that stage, since
a fault from a normal write while a resync is going on will be marked
but the mark will be lost when the bitmap is cleared at the end of the
resync (this is a bug - I'm not sure how the raid code itself reacts
in this situation since there are races here).

Nevertheless the principle is sound. The mark on the bitmap need only
indicate "something may be wrong", not precisely what is wrong. More
precision implies more efficiency at resync time, but it's a tradeoff.

> When a write fails, or when a write is sent to some-but-not-all
> devices, set the relevant bit in the bitmap.

This is what is currently done, I believe, subject to my correct
implementation of the code, and correct understanding of the extant
source. Except that presently it's done per component instead of per
array.

I'm not too sure what whole-array structure to attach the bitmap to.
Suggestions welcome.

> The first time you set a bit, record the current 'event' number with
> the bitmap.

Let me clarify here that you are talking of a counter that is
incremented once per request received on the device, and which is
written to the superblock of all the mirror components at the point when
the global bitmap is first dirtied after previously having been clean.

I'd observe that we dirty the bitmap because a component has just
dropped out/failed, and so the register of that counter on the failed
component will stay at zero (or whatever it was), since we can/should no
longer write to it.

It is not quite clear to me if we should write the counter to a device
which is removed while up to date. Let's see ...

> The interpretation of this bit map is 'only the blocks that are
> flagged in this bitmap have changed since event X'.

This is fine by me.

> On hot-add, read the old superblock.  If it looks valid and matches
> the current array, and has an event counter of X or more, then ignore
> blocks that have [not got] their bits set in the bitmap [when]
> reconstructing, otherwise do a full reconstruction.

I have trouble understanding the implications here.  See items in
square brackets also for possible typos that I've "corrected".

Normally a disk will drop out with the counter set at zero on the disk
component involved, and the inmemory counter actually at X, but it would
not have been possible to write the counter to the sb of the component
since we find out that we should have written it only after it's dead...
So it will come back with its counter still set to zero.  So when it
comes back its counter will NOT be set at "X or more", so we "do a full
reconstruction".

This obviously is not the intention.

I believe that possibly that we should note the value of the counter
X when we last wrote it successfully to all the disks inkernel. This
is a "checkpoint". We can update the checkpoint on all disks (and in
kernel) from time to time, I think.  If the bitmap is dirtied, then it
will have been dirtied since the checkpoint was written in kernel AND on
disk.  This is one invariant.  

When a disk comes back without a replacement having been used in the
meantime, its checkpoint on its sb will match the checkpoint in
kernel, and we can update only the blocks signalled as dirty in the
bitmap.

When a disk comes back after a replacement has been used meantime, then
the checkpoint in the kernel will have advanced (umm, we have to
advance it by one at least on the first write after a hotadd, or do we
have a problem when a faulty disk is introduced as a replacement?)
beyond that on the disk, and we will know that we have to do a full
resync on the component.

> When we have a full compliment of devices again, clear the bitmap and
> the event record.

Not sure about that. It would lead to confusion when a replacement disk
was used with an old checkpoint value. I don't think one can ever 
reset the event counter safely, which means that it needs a generation
counter too.

> The advantages of this include:
>   - only need one bitmap

Agreed.

>   - don't need the hot_repair concept - what we have is more general.

Not sure.

>   - don't need to update the bitmap (which would have to be a
>     bus-locked operation) on every write.

Hmm .. one only updates the bitmap when there is a faulted disk
component, more or less. And I'm not quite sure what "bus-locked" means
.. you mean that the order wrt bus operations must be strictly
preserved? Is that necessary? It's not clear to me.

> Disadvantages:
>   - if two devices fail, will resync some blocks on the later one that
>     don't require it.

Doesn't matter. There are other race conditions which are possibly
worse karma.

> As for ths other bits about a block device 'fixing itself' - I think
> that belongs in user-space.  Have some program monitoring things and
> re-adding the device to the array when it appears to be working again.

I don't agree - programs are simply not reliable enough and users are
not reliable enough to install and configure them.  This can be done in
about 10 lines of kernel code or less, I believe.  1) notify the new
component in which array it is now with a generic ioctl after a hotadd,
2) let the component do a hotadd ioctl through the blkops array and
our struct when it comes back online, if it feels like it.

Thanks for the comments .. let me know if I've misinterpreted
something.

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html