Re: Questions about bitrot and RAID 5/6

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 26 Jan 2014 21:07:36 -0700

On Jan 25, 2014, at 10:56 AM, Wilson Jonathan <piercing_male@xxxxxxxxxxx> wrote:

> On Fri, 2014-01-24 at 13:54 -0700, Chris Murphy wrote:
>> On Jan 24, 2014, at 12:57 PM, Phil Turmel <philip@xxxxxxxxxx> wrote:
>> 
>> Please define "bits lost event" and cite some reference. Google returns exactly ONE hit on that, which is this thread. If we cannot agree on the units, we aren't talking about the same thing, at all, with a commensurately huge misunderstanding of the problem and thus the solution.
>> 
>> So please to not merely respond to the 2nd paragraph you disagree with. Answer the two questions above that paragraph.
>> 
>> If the spec is "1 URE event in 1E14 bits read" that is "1 bit nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector drive, and hilariously becomes far worse at "1 bit nonrecoverable in 3E9 bits read" for 4096 byte physical sector drives.
>> 
>> A very simple misunderstanding should have a very simple corrective answer rather than hand waiving and giving up.
> 
> As I understand it, its "1" error (of no determinate size) for every
> 10E14 bits read….

Well as I understand it the < symbol is the "less than" sign, so if the rate is errors per bits, then it's less than 1 error for ever 10E14 bits read.

> The size of sectors would make no difference to the raw amount of data
> read (although it does open an interesting question of what the 10E14
> actually means, does it also include any check summing data, or is it
> purely "data") nor the fact that 1 URE statistically might happen.

It's an interesting question if "bits read" includes non-user data bits, such as the ECC bits. I'm also curious if there's an ATA or SCSI command that instructs the drive to hand over those 512 bytes, such as they are, despite a read error, or if we're just screwed.

> The amount of data corrupted is, I would have thought, variable
> depending on what forms of checksums etc. was used and is indeterminable
> without knowing the exact forms of work done on the raw data, how many
> checksum values there might be for a "block" and so on, to try and
> recover a meaningful, and valid, return... it could be that just 1 bit
> of data was corrupted or it could be that the entire sectors worth of
> data is garbage; it could also be that the 1 URE is in such a place that
> it causes multiple sectors to be invalid…

I'm willing to bet dollars to donuts that every vendor has differences in the effectiveness of their ECC, yet all of them can detect and correct merely 1 bit in 512/4096 bytes, and actually probably quite a few more bit errors than this.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html