>>>>> "Greg" == Greg Freemyer <greg.freemyer@xxxxxxxxx> writes: Greg> I haven't seen any MD patches at all. Will the MD support verify Greg> the CRC on read and trigger a RAID re-read other mirror on Greg> failure? No. With the data integrity model it is the owner of the integrity metadata that needs to re-drive the I/O in case of failure. So that means the application, filesystem or the block layer depending on who added it. The reason for this is twofold: 1) The owner of the I/O in question has much better knowledge about the context. On a write it can re-run verification checks on its buffers before deciding whether to try again, notify the user, etc. 2) Limiting the number of times we calculate the CRC/checksum. If every layer in the I/O stack did a check things would get painfully slow. So it's better to bubble everything to the top and do it once. That's why it's important to me to ensure that the appropriate signaling is in place so that upper layers can influence what's going on below. I.e. telling MD/DM to retry redundant copies. That said, adding a belt-and-suspenders option to MD/DM to verify all I/O would be trivial. But I don't think it's worth it. Greg> The LHC (Large Hadron Collider) people put out a white paper on Greg> silent corruption a year or two ago. They were very concerned Greg> that it could negatively impact there results. I've been talking to them on and off. >> Both. You could emulate some of the DIX features in software (like >> scatterlist interleaving) and then plug in the long commands on the >> back end. But as Mark said the checksum formats differ between drive >> vendors/models. Greg> The linux kernel obviously supports a large amount of vendor Greg> specific code. However, the actual ECC stored by disk drives is proprietary. The drive vendors have spent years and years refining their algorithms. I think it's highly unlikely that they'd be willing to tell us what's in there and how it's calculated. I really think you should all just go bug your drive vendors about this feature. The ATA add-on (called External Path Protection) was pretty much fully baked when it was shelved. It is compatible with the SCSI ditto so interoperability is a no-brainer. But the drive vendors fought it vehemently. Interestingly enough, SSD vendors seem much more interested in adding competitive features. Greg> Maybe the INTEGRITY crc could be calculated on the fly by libata Greg> for at least a few hard drive vendors that have known CRC Greg> algorithms used with the current long sector reads. It's usually an ECC and not a CRC, btw. And it's relatively big. It's not unusual to be able to correct on the order of 50 bytes out of 512. Greg> ie. When INTEGRITY is enabled and supported hard drives are being Greg> read from, libata requests the long sector with proprietary CRC Greg> and verifies the vendor specific CRC. If it looks good, then the Greg> vendor specific CRC is replaced by the SCSI Spec CRC and the Greg> sector / bios are passed up the line just like a supported SCSI Greg> device would do. Not necessary. The integrity infrastructure is completely agnostic to the data contained in the protection buffer. It's all done by callbacks registered with the block device. And consequently filesystems and applications operate at the "protect this buffer"/"verify this buffer" level. They don't have to know or care about T10, CRCs, ATA or anything. The actual format is negotiated in case of MD/DM that spans devices with potentially different capabilities/checksum formats. With SCSI we have the luxury that the CRC is mandatory so we can always fall back to that. Greg> In-flight is my concern as well. All of the silent corruption Greg> I've seen and taken the time to troubleshoot was caused by Greg> in-flight errors. I've seen it be cables, power supply, Greg> controller, ram, and CPU cache at a minimum. Yup. Greg> That makes sense as well, but given the most filesystems won't Greg> have inherent INTEGRITY support, then the block layer should also Greg> be able to make retry-other-mirror requests of MD / DM. Well, this is somewhat orthogonal. A drive is not going to return good sense information if the CRC didn't match the data. So the I/O is going to fail and DM/MD can retry at will. In that case it doesn't really matter what caused the failure and DM/MD will retry regardless. You could argue that the data could still be corrupted on the way back from the drive. But I haven't seen that happen much. In any case, the verification further up the stack is going to catch the mismatch. Most of the errors I see on READ are due to DMAs that for whatever reason didn't actually happen. That's actually a fun thing to do: Poison all pages in the target scatterlist before issuing a READ. I've had to do that several times to prove that transfers went missing in action. Greg> Also is there any effort to add diagnostic messages at the various Greg> tiers. Greg> You describe this as end-to-end protection, but when it fails, it Greg> would be extremely useful to check dmesg or something and be able Greg> to see that a sector came in from the controller fine, but was Greg> corrupted later, so CPU / memory is suspected vs. sector came in Greg> bad from the controller, so suspect a problem in the controller / Greg> cable / power supply area. Right now we distinguish between errors caught by the HBA and errors caught by the target device. A big problem we're trying to tackle is the case where a write is acknowledged by the RAID controller and stored in non-volatile memory there. Once the RAID controller commits the write to an actual disk the write fails and for some reason the RAID controller doesn't succeed in writing the block elsewhere. In that case the original I/O has been completed at the OS level. There's really no means for the array head to come back and say "Oh, btw. that I/O that I acked a while ago didn't actually make it". And even if it did we would have forgotten all about the context of that I/O so it wouldn't be of much help. So out of band error reporting like that (that also involves SAN switches) is a topic for discussion within the SNIA Data Integrity TWG. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html