Re: The TCP and UDP checksum algorithm may soon need updating

"John R Levine" <johnl@xxxxxxxxx> · 10 Jun 2020 11:17:40 -0400

On Wed, 10 Jun 2020, Warren Kumari wrote:
Having read the papers that Craig referenced, that's my interpretation.

One of them is about a big physics application which sends multiple
terabytes of data over the net using what looks like a version of
FTP that transfers several files at once.  They send the data as a lot
of of 4 gig files. When they started verifying file checksums, they
found about 20% of the received files were corrrupted in transit.

I'm assuming you are talking about "Cross-Geography Scientific Data
Transferring Trends and Behavior", which contains (Section 4.1
Checksum, encryption, and reliability, p.12):

No, it's "Transferring a Petabyte in a Day".

https://www.researchgate.net/publication/325405478_Transferring_a_Petabyte_in_a_Day

"As mentioned, we split each 1.2 TiB snapshot into 256 files of 
approximately equal size. We determined that transferring 64 or 128 files 
concurrently, with a total of 128 or 256 TCP streams, yielded the maximum 
transfer rate. We achieved an average disk-to-disk transfer rate of 92.4 
Gb/s (or 1 PiB in 24 hours and 3 minutes): 99.8% of our goal of 1 PiB in 
24 hours, when the end-to-end verification of data integrity in Globus is 
disabled. In contrast, when the end-to-end verification of data integrity 
in Globus is enabled, we achieved an average transfer rate of only 72 Gb/s 
(or 1 PiB in 30 hours and 52 minutes).

The Globus approach to checksum verification is motivated by the 
observations that the 16-bit TCP checksum is inadequate for detecting data 
corruption during communication [16, 17] and that corruption can occur 
during file system operations [18]. Globus pipelines the transfer and 
checksum computation; that is, the checksum computation of the ith file 
happens in parallel with the transfer of the (i + 1)th file. Data are read 
twice at the source storage system (once for transfer and once for 
checksum) and written once (for transfer) and read once (for checksum) at 
the destination storage system. Therefore, in order to achieve the desired 
rate of 93 Gb/s for checksum-enabled transfers, in the absence of checksum 
failures, 186 Gb/s of read bandwidth from the source storage system and 93 
Gb/s write bandwidth and 93 Gb/s of read bandwidth concurrently from the 
destination storage system are required. If checksum verification failures 
occur (i.e., one or more files are corrupted during the transfer), even 
more storage I/O bandwidth, CPU resources, and network bandwidth are 
required in order to achieve the desired rate."

Globus is a file transfer service from U of Chicago

https://www.globus.org/data-transfer

Regards,
John Levine, johnl@xxxxxxxxx, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly