Re: RFC: detection of silent corruption via ATA long sector reads

"Greg Freemyer" <greg.freemyer@xxxxxxxxx> · Fri, 2 Jan 2009 17:41:50 -0500

Thanks Martin, comments interspersed

On Fri, Jan 2, 2009 at 5:04 PM, Martin K. Petersen
<martin.petersen@xxxxxxxxxx> wrote:
>>>>>> "Greg" == Greg Freemyer <greg.freemyer@xxxxxxxxx> writes:
>
<snip>

> The status is:
>
>  - The infrastructure in the kernel is in place as of .27.  Hoping to
>   get MD/DM support in .29 but I'm running late wrt. the merge window.

I haven't seen any MD patches at all.  Will the MD support verify the
CRC on read and trigger a RAID re-read other mirror on failure?

>  - We recently announced an early adopter program for Oracle DB
>   customers.  The ASM component of the database now supports the
>   integrity hooks so we can true end-to-end integrity protection of DB
>   I/O.

Very cool.

>  - btrfs support is work in progress.
>
>  - Other people have expressed interest in adding support to ext4 and
>   XFS.

Nice, but it seems the block layer will capture that vast majority of issues.

> Greg> especially as it relates to ATA devices?
>
> ATA support was put on hold in the T13 committee because the drive
> vendors don't feel like adding a big, intrusive feature to their
> firmware.  I'm still hoping we can eventually get support added to
> nearline class drives but it'll be a while.  Market demand needs to be
> there first.  I.e. the array vendors that use SATA drives will need to
> start asking for it.
>
> We're just, just, just starting to push out FC support.  Then comes SAS.
> And then hopefully ATA.

The LHC (Large Hadron Collider) people put out a white paper on silent
corruption a year or two ago.   They were very concerned that it could
negatively impact there results.  I don't remember the details, or how
they worked around it.

If they are not already part of your integrity team, you might want to
reach out to them.  And I think they bought / are buying huge amounts
of hardware.

>
> Greg> ie.  Do actual ATA hardware devices that support "T13/ATA External
> Greg> Path Protection" exist yet?  Does it require HDD and controller
> Greg> support?  Or just HDD?
>
> Both.  You could emulate some of the DIX features in software (like
> scatterlist interleaving) and then plug in the long commands on the back
> end.  But as Mark said the checksum formats differ between drive
> vendors/models.

The linux kernel obviously supports a large amount of vendor specific code.

Maybe the INTEGRITY crc could be calculated on the fly by libata for
at least a few hard drive vendors that have known CRC algorithms used
with the current long sector reads.

ie. When INTEGRITY is enabled and supported hard drives are being read
from, libata requests the long sector with proprietary  CRC and
verifies the vendor specific CRC.  If it looks good, then the vendor
specific CRC is replaced by the SCSI Spec CRC and the sector / bios
are passed up the line just like a supported SCSI device would do.

If those drives started selling well, maybe the drive manufactures
could be persuaded to implement the full end-to-end protocol.

> On SCSI you could conceivably use the block integrity stuff to store an
> LVM/MD checksum when used with devices that expose the application tag.
>
> However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not
> exactly a lot of space.  And only dumb drives are going to make it
> available.  Some RAID controllers are going to keep those 16-bits for
> their own internal use.
>
> The main purpose of the block integrity stuff is to protect in-flight
> I/O.  Persistence is an optional feature and a side-effect.

In-flight is my concern as well.  All of the silent corruption I've
seen and taken the time to troubleshoot was caused by in-flight
errors.  I've seen it be cables, power supply, controller, ram, and
CPU cache at a minimum.

> So I think it would be much more worthwhile to implement checksumming in
> MD/DM without relying on special hardware.  I did some experiments in
> that department a few years ago when we were investigating how to go
> about fixing some of the data integrity problems in Linux.
>
> I wrote something akin to DIF in software by doing 64 512-byte blocks +
> 512 bytes of checksums.  The disadvantage there is having to do
> read-modify-write for small writes.  I tried several other approaches
> sacrificing both space and locality but performance was still anemic.
>
> The reason DIF is implemented the way it is (with 520 byte sectors: 512
> bytes followed by 8 bytes of checksum) is to prevent the cost of seeking
> to write the protection information elsewhere.  With solid state devices
> that seek penalty doesn't exist so this may become less of an issue
> going forward.
>
> The beauty of checksumming in btrfs is that the checksum is stored in
> the filesystem metadata which is read/written anyway.  So the only
> overhead is in calculating the actual checksum.  That's something
> virtual block devices have a much harder time providing because they
> don't have metadata describing individual blocks.
>
> That doesn't mean it can't be done but it's a lot more work.  I'm
> personally much more interested in adding support for adding a
> retry-other-mirror interface to MD/DM and leave the checksumming to the
> filesystems.

That makes sense as well, but given the most filesystems won't have
inherent INTEGRITY support, then the block layer should also be able
to make retry-other-mirror requests of MD / DM.

> --
> Martin K. Petersen      Oracle Linux Engineering
>

Also is there any effort to add diagnostic messages at the various tiers.

You describe this as end-to-end protection, but when it fails, it
would be extremely useful to check dmesg or something and be able to
see that a sector came in from the controller fine, but was corrupted
later, so CPU / memory is suspected vs. sector came in bad from the
controller, so suspect a problem in the controller / cable / power
supply area.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html