Re: [PATCH 1/2] md bitmap bug fixes

ptb@xxxxxxxxxxxxxx (Peter T. Breuer) · Tue, 22 Mar 2005 23:35:02 +0100

Paul Clements <paul.clements@xxxxxxxxxxxx> wrote:
>      system A
>      [raid1]
>      /     \
> [disk]    [nbd] --> system B
> 
> 2) you're writing, say, block 10 to the raid1 when A crashes (block 10 
> is dirty in the bitmap, and you don't know whether it got written to the 
> disk on A or B, neither, or both)

Let me offer an example based on this scenario.  Block 10 is sent to
both, and B's bitmap is dirtied for it, but the data itself never
arrives.  At the same time block 10 is sent to A, and the bitmap is
dirtied for it, the data sent, and (miraculously) the bitmap on A is
cleared for the received data (I don't now why or how - nobody has yet
specified the algorithm with enough precision for me to say).

At this point B's bitmap is dirty for block 10, and A's is not. A has
received the data for block 10, and B has not.

> 3) something (i.e., your cluster framework) notices that A is gone and 
> brings up a new raid1, with an empty bitmap, on system B:

Now, this looks wrong, because to sync A from B we will later need to
copy block 10 from B to A in order to "undo" the extra write already
done on A, and A's bitmap is not marked dirty for block 10, only B's is,
so we cannot zero B's bitmap because that would lose the information
about block 10.

 --

I've been thinking about this in more general terms, and it seems to me
that the algorithms offered (and I say I have not seen enough
detail to be sure) may be in general "insufficiently pessimistic".

That is, they may clear the bitmap too soon (as in the thought
experiment above).  Or they may not dirty the bitmaps soon enough.

I believe that you are aiming for algorithms in which the _combined_
bitmaps are "sufficiently pessimistic", but the individual bitmaps
are not necesarily so.

But it appears to me as though it may not be much trouble to ensure that
_each_ bitmap is sufficiently pessimistic on its own with respect to
clearing.  Just clear _each_ bitmap only when _both_ writes have been
done.

 --

Can this plan fail to be pessimistic enough with respect to dirtying
the bitmaps in the first place?

What if block 10 is sent to A, which is to say the bitmap on A is
dirtied, and the data sent, and received on A. Can B _not_ have its
bitmap dirtied for block 10?

Well, yes, if A dies before sending out the bitmap dirty to B, but after
sending out the bitmap dirty AND the data to A.  That's normally not
possible. We normally surely send out all bitmap dirties before sending
out any data. But can we wait for these to complete before starting on
the data writes? If B times out, we will have to go ahead and dirty A's
bitmap on its own and thereafter always dirty and never clear it.

So this corresponds to A continuing to work after losing contact with B.

Now, if A dies after that, and for some reason we start using B, then B
will need eventually to have its block 10 sent to A when we resync A
from B.

But we never should have switched to B in the first place! B was
expelled from the array.  But A maybe died before saying so to anyone.

Well, plainly A should not have gone on to write anything in the array
after expelling B until it was able to write in its (A's) superblock
that B had been expelled.

Then, later, on recovery with a sync from B to A (even though it is the
wrong direction), A will either say in its sb that B has not been
expelled AND contain no extra writes t be undone from B, or A will say
that B has been expelled, and its bitmap will say which writes have
been done that were not done on B, and we can happily decide to sync
from B, or sync from A.

So it looks like there are indeed several admin foul-ups and crossed
wires which could give us reason to sync in the rong direction, and
then we will want to know what the recipient has in its bitmap. But
we will be able to see that that is the situuation.

In all other cases, it is sufficient to know just the bitmap on the
master.

The particular dubious situation outlined here is

1) A loses contact with B and continues working without B in the array,
  so B is out of date.

2) A dies, and B is recovered, becoming used as the master.

3) When A is recovered, we choose to sync A from B, not B from A.

In that case we need to look at bitmaps both sides. But note that one
bitmap per array (on the "local" side) would suffice in this case. The
array node location shifts during the process outlined, givig two
bitmaps to make use of eventually.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html