On Tue, Jun 9, 2020 at 9:08 PM John Levine <johnl@xxxxxxxxx> wrote: > > In article <3ac60a21-4aee-d742-bedc-5be3a4e65471@xxxxxxxx>, > Michael Thomas <mike@xxxxxxxx> wrote: > >So the long and short of this entire issue seems to be is, is the > >uncaught error rate serious enough that warrant rethinking weak > >transport and frankly L2 layer error detection? ... > > Having read the papers that Craig referenced, that's my interpretation. > > One of them is about a big physics application which sends multiple > terabytes of data over the net using what looks like a version of > FTP that transfers several files at once. They send the data as a lot > of of 4 gig files. When they started verifying file checksums, they > found about 20% of the received files were corrrupted in transit. I'm assuming you are talking about "Cross-Geography Scientific Data Transferring Trends and Behavior", which contains (Section 4.1 Checksum, encryption, and reliability, p.12): "We note that if a user changes a file during a transfer, this action can be reported as an integrity failure. We cannot distinguish this from an actual failure." Yes, I'm sure that checksum errors do exist, but from my quick checks I haven't been seeing anything like the error rates discussed here -- and, as a quick sanity check, 4GB is on the same order as a DVD/many OS distributions: RedHat: 8.2.0 - 8GB, 7.9 Beta - 4GB -- https://developers.redhat.com/products/rhel/download Ubuntu: 18.04 (Desktop): 2GB -- https://releases.ubuntu.com/18.04/ Pop!_OS: 20.04LTS: 2.36 GB (NVIDIA) -- https://pop.system76.com/ Fedora 32: Standard ISO image for x86_64: 1.9GB -- https://getfedora.org/en/server/download/ Kali Linux 64-Bit (Installer): 3.6GB -- https://www.kali.org/downloads/ Linux Mint 19.3 "Tricia" - Cinnamon (64-bit): 1.9GB -- https://www.linuxmint.com/edition.php?id=274 debian-10.4.0-amd64-DVD-3.iso: 4.4GB -- https://cdimage.debian.org/debian-cd/current/amd64/iso-dvd/ I'm assuming that a: almost all of us have downloaded multiple copies of at least a few of these, and b: we check the hashes on the ISOs we downloaded. I certainly haven't been seeing anything like 1 in 5, or 1 in 10 ISO downloads with a corrupt hash[0]. I also move significant amounts of data around - perhaps I'm just blessed, but if I were getting corruption on anything approaching that level, I'm sure I'd have noticed - 20% errors in 4GB files mean I should be seeing a corruption once every ~20GB. I regularly move TBs around (backups, DRBD, large containers, databases, etc) - ssh/scp will log "2: Packet corrupt" (or "Corrupted MAC on input. Disconnecting: Packet corrupt" on the server side). I stuff all of my logs into a combination of Logstash and Loki, and querying this gives no occurrences of this message: Loki: "{job="syslog",type="server"} |~ "sshd.*Corrupt MAC" == 0 Again, I'm sure that there are checksum errors, but I think that a: there is lots of data that can be easily looked at to estimate occurrence (including from CDNs and large scale operators), b: we need to prioritize what we work on. I'd love to see people having a look at their systems and reporting what sorts of errors they see.... W [0]: actually I've only once seen the checksum not match, and that was because of a NAT box which tried to ALG fixups in the payload and replaced all occurrences of the external address bit-pattern with the internal one... > > In that application they resend the corrupt files and they obviously > need make the files smaller. But retransmitting a file at a time seems > a lot less efficient than improving the checksums and using the > existing TCP packet level retransmission. > > -- > Regards, > John Levine, johnl@xxxxxxxxx, Primary Perpetrator of "The Internet for Dummies", > Please consider the environment before reading this e-mail. https://jl.ly > -- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf