RE: --assume-clean on raid5/6

<brian.foster@xxxxxxx> · Sun, 8 Aug 2010 10:17:01 -0400

> -----Original Message-----
> From: Neil Brown [mailto:neilb@xxxxxxx]
> Sent: Sunday, August 08, 2010 4:56 AM
> To: st0ff@xxxxxx
> Cc: stefan.huebner@xxxxxxxxxxxxxxxxxx; Foster, Brian; linux-
> raid@xxxxxxxxxxxxxxx
> Subject: Re: --assume-clean on raid5/6
> 
> On Sat, 07 Aug 2010 14:28:55 +0200
> Stefan /*St0fF*/ Hübner <stefan.huebner@xxxxxxxxxxxxxxxxxx> wrote:
> 
> > Hi Brian,
> >
> > --assume-clean skips over the initial resync.  Which - if you will
> > create a filesystem after creating the array - is a time-saving idea.
> > But keep in mind: even if the disks are brand new and contain only
> > zeros, the parity would probably look not all zeros.  So reading from
> > such an array would be a bad idea.
> > But if the next thing you do is create LVM/filesystem etc., then all
> bit
> > read from the array will have been written to before (and by that are
> in
> > sync).
> 
> There is an important point that this misses.
> 
> When md updates a block on a RAID5 it will sometimes use a read-modify-
> write
> cycle which reads the old block and old parity, subtracts the old block
> from
> the parity block and then added the new block to the parity block.
> Then it
> writes the new data block and the new parity block.
> 
> If the old parity was correct for the old stripe, then the new parity
> will be
> correct for the new stripe.  But if the old was wrong then the new will
> be
> wrong.
> 
> So if you use assume-clean then the parity may well be wrong and could
> remain
> wrong even when you write new data.  If you then lose a device, the
> data for
> that device will be computed using wrong parity and you will get wrong
> data -
> hence data corruption.
> 
> So you should only use --assume-clean if you know the array really is
> 'clean'.
> 

Thanks for the information guys. I was actually attempting to test whether this could occur with a high-level sequence similar to the following:

- dd /dev/urandom data to 4 small partitions (~10MB each).
- Create a raid5 with --assume-clean on said partitions.
- Write a small bit of data (32 bytes) to the beginning of the md, capture an image of the md to a file.
- Fail/remove a drive from the md, capture a second md file image.
- cmp the file images to see what changed, and read back the first 32 bytes of data.

In this scenario I do observe differences in the file image, but my data remains intact. I ran this sequence multiple times, each time failing a different drive in the array and also tried to stop/restart the array (with a drop_caches in between) before the drive failure step. This leads to my question: is there a write test that can reproduce data corruption under this scenario, or is the rmw cycle some kind of optimization that is not so deterministic?

Also out of curiousity, would --assume-clean be safe on a raid5 if the drives were explicitly zeroed beforehand? Thanks again.

Brian

> RAID1/RAID10 cannot suffer from this so --assume-clean is quite safe
> with
> those array types.
> The current implementation of RAID6 never does read-modify-write so
> --assume-clean is currently safe with RAID6 too.  However I do not
> promise
> that RAID6 might not change to use read-modify-write cycles in some
> future
> implementation.  So I would not recommend using --assume-clean on RAID6
> just
> to avoid the resync cost.
> 
> NeilBrown
> 
> >
> > Stefan
> >
> > Am 06.08.2010 03:19, schrieb brian.foster@xxxxxxx:
> > > Hi all,
> > >
> > > I've read in the list archives that use of --assume-clean on raid5
> > > (raid6?) is not safe assuming the member drives are not sync, but
> it's
> > > not clear to me as to why. I can see the content of an written
> raid5
> > > array change if I fail a drive out of the array (created w/
> > > --assume-clean), but data that I write prior to failing a drive
> remains
> > > intact. Perhaps I'm missing something. Could somebody elaborate on
> the
> > > danger/risk of using --assume-clean? Thanks in advance.
> > >
> > > Brian
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-
> raid" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f