RE: raid5 write performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Mike Hardy [mailto:mhardy@xxxxxxx]
> Sent: Friday, November 18, 2005 11:57 PM
> To: Guy
> Cc: 'Dan Stromberg'; 'Jure Pečar'; linux-raid@xxxxxxxxxxxxxxx
> Subject: Re: raid5 write performance
> 
> 
> 
> Guy wrote:
> 
> > It is not just a parity issue.  If you have a 4 disk RAID 5, you can't
> be
> > sure which if any have written the stripe.  Maybe the parity was
> updated,
> > but nothing else.  Maybe the parity and 2 data disks, leaving 1 data
> disk
> > with old data.
> >
> > Beyond that, md does write caching.  I don't think the file system can
> tell
> > when a write is truly complete.  I don't recall ever having a Linux
> system
> > crash, so I am not worried.  But power failures cause the same risk, or
> > maybe more.  I have seen power failures, even with a UPS!
> 
> Good points there Guy - I do like your example. I'll go further with
> crashing too and say that I actually crash outright occasionally.
> Usually when building out new machines where I don't know the proper
> driver tweaks, or failing hardware, but it happens without power loss.
> Its important to get this correct and well understood.
> 
> That said, unless I hear otherwise from someone that works in the code,
> I think md won't report the write as complete to upper layers until it
> actually is. I don't believe it does write-caching, and regardless, if
> it does it must not do it until some durable representation of the data
> is committed to hardware and the parity stays dirty until redundancy is
> committed.
> 
> Building on that, barring hardware write-caching, I think with a
> journalling FS like ext3 and md only reporting the write complete when
> it really is, things won't be trusted at the FS level unless they're
> durably written to hardware.
> 
> I think that's sufficient to prove consistency across crashes.
> 
> For example, even if you crash during an update to a file smaller than a
> stripe, the stripe will be dirty so the bad parity will be discarded and
> the filesystem won't trust the blocks that didn't get reported back as
> written by md. So that file update is lost, but the FS is consistent and
> all the data it can reach is consistent with what it thinks is there.
> 
> So, I continue to believe silent corruption is mythical. I'm still open
> to good explanation it's not though.
> 
> -Mike

I will take a stab at an explanation.

Assume a single stripe has data for 2 different files (A and B).  A disk has
failed.  The file system writes a 4K chunk of data to file A.  The parity
gets updated, but not the data.  Or the data gets updated but not the
parity.  The system crashes or power fails.  The system recovers, but can't
do anything about the parity with a failed disk.  But the filesystem does
its thing.  The disk is replaced and added, the data block is reconstructed
using a good block and a bad block.  The parity now matches the data.  But
the reconstructed block (file B) is wrong and the file using that block has
not been changed for years.  So, silent corruption.  But you could argue it
was a double failure.

Also, after a power failure, I have seen the system come back with a single
disk failed.  I guess that 1 disk had an expired superblock.  When you add
that disk back, any stripes that were not up to date will be re-constructed
with invalid data.  I don't know if the intent logging will help here or
not.  Most likely, more than 1 disk will have an expired superblock.  If you
force md to assemble the array it resyncs, but I don't know if it does the
parity or picks a disk to pretend was just added.  It show a single disk as
rebuilding, if that is correct, silent corruption could occur.

Guy

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux