Re: [PATCH 5/9] raid5: log recovery

Shaohua Li <shli@xxxxxx> · Wed, 5 Aug 2015 14:39:09 -0700

On Wed, Aug 05, 2015 at 02:05:25PM +1000, NeilBrown wrote:
> On Wed, 29 Jul 2015 17:38:45 -0700 Shaohua Li <shli@xxxxxx> wrote:
> 
> > This is the log recovery support. The process is quite straightforward.
> > We scan the log and read all valid meta/data/parity into memory. If a
> > stripe's data/parity checksum is correct, the stripe will be recoveried.
> > Otherwise, it's discarded and we don't scan the log further. The reclaim
> > process guarantees stripe which starts to be flushed raid disks has
> > completed data/parity and has correct checksum. To recovery a stripe, we
> > just copy its data/parity to corresponding raid disks.
> > 
> > The trick thing is superblock update after recovery. we can't let
> > superblock point to last valid meta block. The log might look like:
> > | meta 1| meta 2| meta 3|
> > meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
> > points to meta 1, we write a new valid meta 2n.  If crash happens again,
> > new recovery will start from meta 1. Since meta 2n is valid, recovery
> > will think meta 3 is valid, which is wrong.  The solution is we create a
> > new meta in meta2 with its seq == meta 1's seq + 2 and let superblock
> > points to meta2.  recovery will not think meta 3 is a valid meta,
> > because its seq is wrong
> 
> I like the idea of using a slightly larger 'seq' to avoid collisions -
> except that I would probably feel safer with a much larger seq. May add
> 1024 or something (at least 10).

ok 
> > 
> > TODO:
> > -recovery should run the stripe cache state machine in case of disk
> > breakage.
> 
> Why?
> 
> when you write to the log, you write all of the blocks that need
> updating, whether they are destined for a failed device or not.
> 
> When you recover, you then have all the blocks that you might want to
> write.  So write all the ones for which you have working devices, and
> ignore the rest.
> 
> Did I miss something?
> 
> Not that I object, but if it works....

I mean the case of disk is broken. For example, log has a stripe with
data for disk 1, 2, 4. In recovery, disk 2 is broken. Just write 1, 4
isn't good. If we run the state machine, we can read disk 3 and have an
eventually consistent stripe.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html