Re: RFC: detection of silent corruption via ATA long sector reads

John Robinson <john.robinson@xxxxxxxxxxxxxxxx> · Sun, 04 Jan 2009 12:31:17 +0000

On 04/01/2009 07:37, Martin K. Petersen wrote:
"John" == John Robinson <john.robinson@xxxxxxxxxxxxxxxx> writes:

John> Excuse me if I'm being dense - and indeed tell me! - but RAID
John> 4/5/6 already suffer from having to do ready-modify-write for
John> small writes, so is there any chance this could be done at
John> relatively little additional expense for these?

You'd still need to store a checksum somewhere else, incurring
additional seek cost.  You could attempt to weasel out of that by adding
the checksum sector after a limited number of blocks and hope that you'd
be able to pull it in or write it out in one sweep.

The downside is that assume we do checksums on - say - 8KB chunks in the
RAID5 case.  We only need to store a few handfuls of bytes of checksum
goo per block.  But we can't address less than a 512 byte sector.  So we
need to either waste the bulk of 1 sector for every 16 to increase the
likelihood of adjacent access.  Or we can push the checksum sector
further out to fill it completely.  That wastes less space but has a
higher chance of causing an extra seek.  Pick your poison.

Well, I was assuming that MD/DM operates in chunk size amounts (e.g. 32K 
or 64 sectors) anyway, and having a sector or two of checksums on disc 
immediately following each chunk would be a pretty small cost, 
increasing each read or write cycle only marginally (e.g. to 65 
sectors), which shouldn't cause much drop in performance (I guess 1/64th 
in throughput and IOPS, if the discs themselves are the bottleneck). 
Essentially DIF on 32k blocks instead of 512 byte ones. But perhaps this 
is a bad assumption and MD/DM already optimises out whole-chunk reads 
and writes where they're not required (for very short, 
less-than-one-chunk transactions), and I've no idea whether this happens 
a lot.

The reason I'm advocating checksumming on logical (filesystem) blocks is
that the filesystems have a much better idea what's good and what's bad
in a recovery situation.  And the filesystems already have an
infrastructure for storing metadata like checksums.  The cost of
accessing that metadata is inherent and inevitable.

Yes, I can see that. But the old premise that RAID tried to maintain was 
that disc sectors don't go bad. You're quite reasonably dropping the 
premise rather than trying to do more to maintain it. There might be 
validity to both approaches.

We also don't want to do checksumming at every layer.  That's going to
suck from a performance perspective.  It's better to do checksumming
high up in the stack and only do it once.  As long as we give the upper
layers the option of re-driving the I/O.

That involves adding a cookie to each bio that gets filled out by DM/MD
on completion.  If the filesystem checksum fails we can resubmit the I/O
and pass along the cookie indicating that we want a different copy than
the one the cookie represents.

I'd like to understand this mechanism better; at first glance it's 
either going to be too simplistic and not cover the various block layer 
cases well, or it means you end up re-implementing RAID and LVM in the 
filesystem.

Just my €$£0.02 of course.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html