On Sat, Jun 6, 2020 at 11:47 PM Masataka Ohta <mohta@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > Craig Partridge wrote: > > > OK, on to what people are seeing today. This shows that 1 in every 121 > > file transfers FTP delivers a file that, when you do the md5 sum, turns out > > not to match the original (note there are multiple possible reasons, but > > TCP checksum is a strong candidate). > > That's unreasonable because most errors are detected by datalink > layer checksum and almost all remaining errors are detected by > transport layer checksum, which should have been the reason > why transport checksum need not be so strong. I was trying to avoid this thread, but... I'm somewhat surprised by some of the numbers (like the 1 in every 121 file transfers). As a quick check, I looked on a personal webserver (connections from random people on the Internets), and have received 201.1TB, and 207,183,907,570 (207 billion) packets. Netstat shows a total of 1029 (detected[0]) CRC errors, or around one every ~200M packets. I *think* (but may be completely wrong!) that the chance of a 16bit checksum giving a false-negative is just 2^16[1], so 200M*2^16 one in around every 13 billion packets? My average packet size is ~970byes, so that is ~one bad packet in ~12PB. I believe the L2 quality and checksums has increased sufficiently that it has made up for the increase in bandwidth and data volumes -- I'm sure we all used to watch, and expect errors on PSTN modems, 100Meg Ethernet, 56k leased lines and the like. I still graph FCS/CRC errors on router interfaces, but they are basically just empty graphs these days... Are others seeing much much worse numbers from looking at their counters? W [0]: Yup, it's possible that there were some number of undetected ones (the whole debate in this thread), but assuming that there are not systemic issues in the checksum algorithm I believe that the chance of an error occurring and the checksum calculation happening to match is 2^(length of checksum), and so there is a ~1.6% chance of it happening (1029/2^16), but I bumped the numbers up to 1030 anyway. [1]: Actually, I think I overestimate the chance of this happening -- there would need to be a corruption of a packet, *and* the checksum in the same packet would need to be corrupted, and happen to be the correct one for that corrupted packet (a 1 in 2^16 chance) - but, that requires (at least) 2 corruptions in the same packet, or a corruption which happens to calculate to the same value, and I don't know how to easily account for that, so I'll just use the worse case estimate. > > > Anecdotally, folks are reporting some middlebox vendors are not updating > > the TCP checksum but rather letting the outbound interface simply recompute > > the entire checksum -- which means that if the TCP segment gets damaged > > during middlebox handling, the middlebox will slap a valid checksum on bad > > data. > > That should be the real problem to make transport checksum not > to work end to end. > > Thus, your proposal to have stronger checksum can not prevent > file corruptions. > > So, we should make middlebox vendors to update checksum incrementally > or, check the original checksum just before sending a packet > with the original header (not applicable if payload is also modified). > > Masataka Ohta > > PS > > This is a old problem documented in the original paper on > the E2E principle. > > https://dl.acm.org/doi/pdf/10.1145/357401.357402 > 2.2 A Too-Real Example > One gateway computer developed a transient > error: while copying data from an input to > an output buffer a byte pair was > interchanged, with a frequency of about one > such interchange in every million bytes passed. > -- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf