On Feb 27, 2007, at 00:11, Nicolas Pitre wrote:
BTW, here are a few issues with the current pack file format:
- The final SHA1 consists of the count of objects in the file
and all compressed data. Why? This is horrible for streaming
applications where you only know the count of objects at the
end, then you need to access *all* data to compute the SHA-1.
Much better to just use compute a SHA1 over the SHA1's of each
object. That way at least the data streamed can be streamed to
disk. Buffering one SHA1 per object is probably going to be OK.
We always know the number of objects before actually constructing or
streaming a pack. Finding best delta matches require that we sort the
object list by type, but for good locality we need to re-sort that
list
by recency. So we always know the number of objects before
starting to
write since we need to have the list of objects in memory anyway.
When I import a large code-base (such as a *.tar.gz), I don't know
beforehand how many objects I'm going to create. Ideally, I'd like
to stream them directly into a new pack without ever having to write
the expanded source to the filesystem.
Also the receiving end of a streamed pack wants to know the number of
objects first if only to provide the user with some progress report.
- The object count is implicit in the SHA1 of all objects and the
objects we find in the file. Why do we need it in the first place?
Better to recompute it when necessary. This makes true streaming
possible.
Sorry, I don't follow you here.
The object-count at the beginning of the pack is a little strange for
local on-disk pack files, as it is data that can easily be derived.
The *index* would seem to be the proper place for this.
Also, it is not possible to write a dummy 0 in the count and then
fill in
the correct count at the end, because the final SHA1 at the end of
the pack
file is a checksum over the count followed by all the pack data.
So for creating a large pack from a stream of data, you have to do
the following:
1. write out a temporary pack file to disk without correct count
2. fix-up the count
3. read the entire temporary pack file to compute the final SHA-1
4. fix-up the SHA1 at the end of the file
5. construct and write out the index
There are a few ways to fixing this:
- Have a count of 0xffffffff mean: look in the index for the count.
Pulling/pushing would still use regular counted pack files.
- Have the pack file checksum be the SHA1 of (the count followed
by the SHA1 of the compressed data of each object). This would
allow 3.
to be done without reading back all data.
-Geert
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html