Ranged GET for chunkd, tabled

Jeff Garzik <jeff@xxxxxxxxxx> · Fri, 09 Jul 2010 18:21:02 -0400

Reading -less- than the entire file is a required attribute of the S3 
API, where the Range HTTP header is specified to the GET method, 
supplying the byte range for the request.  This corrects an otherwise 
obvious limitation in the protocol:  if you desire only a 4k chunk of a 
2GB file, you should not be forced to download all of the 2GB file.

Partial-GET is also a must-have feature for my other two hacking 
projects, itd and nfs4d.  When executing a SCSI READ, itd will not want 
to download a huge amount of data, just to handle a 4-LBA request. 
Similarly with nfs4d, executing a READ of an NFS file should not require 
nfs4d to download more data than required from chunkd.

For tabled, the implementation requires a bit of modification to the 
event-driven GET code path, but nothing overly burdensome.  It largely 
relies on chunkd, though, to provide the ability to retrieve only a 
portion of the specified object.

For chunkd, the implementation of partial-GET is also relatively 
straightforward, but it introduces a few minor protocol issues.

Presently, we checksum the entire object at PUT time, and return that 
checksum at GET time, so that the client may verify the [strong] 
checksum to ensure no data corruption occurred.

A partial-GET implies the checksum is useless, and must be recomputed 
just for the object subset being requested.  Unfortunately, this also 
implies a key optimization, checksum offload (which goes straight from 
kernel pages to NIC TCP output via DMA, all in hardware) becomes impossible.

On an unencrypted GET, chunkd executes sendfile(2), thereby eliminating 
several memory copies that would otherwise be made by the app and by the 
kernel.  sendfile(2) automatically reads data from an fd, and writes 
that data to another fd, all without ever exposing that data directly to 
the app.  As such, partial-GET with checksumming would require replacing
	sendfile(out_fd, in_fd, &offset, bytes);
with
	while (buffer not completely written to out_fd)
		read(in_fd, buf, count)
		SHA1_hash(buf)
		write(out_fd, buf, count)

The protocol issue is related.  If we are to deliver the checksum in the 
-header-, that implies that entire partial-GET object data must be read 
and checksummed prior to creating the message header.  Then, the message 
header and object data is sent.  Incredibly inefficient.  The 
time-honored solution is putting the checksum at the end of the data 
stream, thereby allowing the checksum to be generating during data 
transmission.

Another issue this raises is checksum verification.  Ideally we want to 
have pre-stored checksum, so that the local node can verify at data 
transmission time that what it reads off disk matches what it wrote $N 
days ago.  Simply creating a checksum of what you write(2) to a TCP 
connection does not protect against disk corruption.

One solution is to update the chunkd disk format (again), and introduce 
checksums for each fixed-block, ie. one checksum for each 64k in a file. 
 This would enable chunkd to verify, prior to sending data on a 
partial-GET, that the data pulled off disk is not corrupted.

Just some food for thought :)

	Jeff

--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html