Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue

NeilBrown <neilb@xxxxxxx> · Thu, 2 Apr 2015 11:19:41 +1100

On Wed, 1 Apr 2015 16:40:57 -0700 Shaohua Li <shli@xxxxxx> wrote:

> >  Your code does avoid write-hole-protection for fill-stripe-writes, and this
> >  would greatly reduce the  number of block that were written multiple times.
> >  However I'm not convinced that is correct.
> >  A reasonable  goal is that if the system crashes while writing to a storage
> >  device, then reads should return the old data or not new data, not anything
> >  else.  A crash in the middle of a full-stripe-write to a degraded array
> >  could result in some block in the stripe appearing to contain data that is
> >  different to both the old and the new.  If you are going to close the whole,
> >  I think it should be done properly.
> 
> I can do it simpley. But don't think this assumption is true. If you
> write to a disk range and there is failure, there is nothing guarantee
> you can either read old data or new data.

If you write a range of blocks to a normal disk and crash during the write,
each block will contain either the old data or the new data.
If you write a range to a degraded RAID5 and crash during the write, you
cannot make that same guarantee.
I don't know how important this is, but then I don't really know how
important any of this is.

> 
> > 
> >  A combined log would "simply" involve writing every data block and  every
> >  compute parity block (with index information) to the log device.
> >  Replaying the log would collect data blocks and flush out those in a stripe
> >  once the parity block(s) for that stripe became available.
> > 
> >  I think this would actually turn into a fairly simple logging mechanism.
> 
> It's not simple at all. It's unlikely we write data and parity
> continuously in disk and in the same time. This will make log checkpoint
> fairly complex.

I don't see any cause for complexity.  Let me be more explicit.

I imagine that all data remains in the stripe cache, in memory, until it is
finally written to the RAID5.  So the stripe cache will need to be quite a
bit bigger.

Every time we get a block that we want to write, either a new data block or a
a computed parity block, we queue it to the log.

The log works like this:
 - take the first (e.g.) 256 blocks in the queue, create a header to describe
   them, write the header with FUA, then write all the data blocks.  If there
   are fewer than 256, just write what we have.
 - when the header write completes, all blocks written *previously* are now
   safe and we can call bio_end_io on data or unlock the stripe for parity.
 - loop back and write some more blocks.  If there are no blocks to write,
   write a header which describes an empty set of blocks, and wait for more
   blocks to appear.

Each stripe_head needs to track (roughly) where the relevant blocks were
written so it can release them when the stripe is written.
I would conceptually divide the log into 32 regions and keep a 32bit number
with each stripe.  When a block is assigned to a region in the log, the
relevant bit is set for the stripe, and a per-region counter is incremented.
When a stripe completes its write, the region counters for all the bits are 
cleared.  The log cannot progress into a region which has a non-zero counter.

We choose the size of transactions so that the first block of each region is
a header block.  These contain a magic number, a sequence number, and a
checksum together with the addresses of the data/parity blocks.  On restart
we read all 32 of these to find out where the log starts and ends.  Then we
replay all the blocks into the stripe cache - discarding any that don't come
with the required parity blocks.

So it is a very simple log which is never read exact on crash recovery.  It
commits everything ASAP so that the writeout to the array can be lazy and can
gather related blocks and sort address etc with not impact on filesystem
latency.

Does that make sense?

NeilBrown
Attachment:
pgptDu8KpVScN.pgp

Description: OpenPGP digital signature