Reading -less- than the entire file is a required attribute of the S3
API, where the Range HTTP header is specified to the GET method,
supplying the byte range for the request. This corrects an otherwise
obvious limitation in the protocol: if you desire only a 4k chunk of a
2GB file, you should not be forced to download all of the 2GB file.
Partial-GET is also a must-have feature for my other two hacking
projects, itd and nfs4d. When executing a SCSI READ, itd will not want
to download a huge amount of data, just to handle a 4-LBA request.
Similarly with nfs4d, executing a READ of an NFS file should not require
nfs4d to download more data than required from chunkd.
For tabled, the implementation requires a bit of modification to the
event-driven GET code path, but nothing overly burdensome. It largely
relies on chunkd, though, to provide the ability to retrieve only a
portion of the specified object.
For chunkd, the implementation of partial-GET is also relatively
straightforward, but it introduces a few minor protocol issues.
Presently, we checksum the entire object at PUT time, and return that
checksum at GET time, so that the client may verify the [strong]
checksum to ensure no data corruption occurred.
A partial-GET implies the checksum is useless, and must be recomputed
just for the object subset being requested. Unfortunately, this also
implies a key optimization, checksum offload (which goes straight from
kernel pages to NIC TCP output via DMA, all in hardware) becomes impossible.
On an unencrypted GET, chunkd executes sendfile(2), thereby eliminating
several memory copies that would otherwise be made by the app and by the
kernel. sendfile(2) automatically reads data from an fd, and writes
that data to another fd, all without ever exposing that data directly to
the app. As such, partial-GET with checksumming would require replacing
sendfile(out_fd, in_fd, &offset, bytes);
with
while (buffer not completely written to out_fd)
read(in_fd, buf, count)
SHA1_hash(buf)
write(out_fd, buf, count)
The protocol issue is related. If we are to deliver the checksum in the
-header-, that implies that entire partial-GET object data must be read
and checksummed prior to creating the message header. Then, the message
header and object data is sent. Incredibly inefficient. The
time-honored solution is putting the checksum at the end of the data
stream, thereby allowing the checksum to be generating during data
transmission.
Another issue this raises is checksum verification. Ideally we want to
have pre-stored checksum, so that the local node can verify at data
transmission time that what it reads off disk matches what it wrote $N
days ago. Simply creating a checksum of what you write(2) to a TCP
connection does not protect against disk corruption.
One solution is to update the chunkd disk format (again), and introduce
checksums for each fixed-block, ie. one checksum for each 64k in a file.
This would enable chunkd to verify, prior to sending data on a
partial-GET, that the data pulled off disk is not corrupted.
Just some food for thought :)
Jeff
--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html