Re: [PATCH] Support 64-bit indexes for pack files.

Geert Bosch <bosch@xxxxxxxxxxx> · Tue, 27 Feb 2007 11:04:41 -0500

On Feb 27, 2007, at 00:11, Nicolas Pitre wrote:

BTW, here are a few issues with the current pack file format:
 - The final SHA1 consists of the count of objects in the file
   and all compressed data. Why? This is horrible for streaming
   applications where you only know the count of objects at the
   end, then you need to access *all* data to compute the SHA-1.
   Much better to just use compute a SHA1 over the SHA1's of each
   object. That way at least the data streamed can be streamed to
   disk. Buffering one SHA1 per object is probably going to be OK.

We always know the number of objects before actually constructing or
streaming a pack.  Finding best delta matches require that we sort the
object list by type, but for good locality we need to re-sort that  
list
by recency.  So we always know the number of objects before  
starting to
write since we need to have the list of objects in memory anyway.

When I import a large code-base (such as a *.tar.gz), I don't know
beforehand how many objects I'm going to create. Ideally, I'd like
to stream them directly into a new pack without ever having to write
the expanded source to the filesystem.

Also the receiving end of a streamed pack wants to know the number of
objects first if only to provide the user with some progress report.

 - The object count is implicit in the SHA1 of all objects and the
   objects we find in the file. Why do we need it in the first place?
   Better to recompute it when necessary. This makes true streaming
   possible.

Sorry, I don't follow you here.

The object-count at the beginning of the pack is a little strange for
local on-disk pack files, as it is data that can easily be derived.
The *index* would seem to be the proper place for this.

Also, it is not possible to write a dummy 0 in the count and then  
fill in
the correct count at the end, because the final SHA1 at the end of  
the pack
file is a checksum over the count followed by all the pack data.
So for creating a large pack from a stream of data, you have to do  
the following:
  1. write out a temporary pack file to disk without correct count
  2. fix-up the count
  3. read the entire temporary pack file to compute the final SHA-1
  4. fix-up the SHA1 at the end of the file
  5. construct and write out the index

There are a few ways to fixing this:
  - Have a count of 0xffffffff mean: look in the index for the count.
    Pulling/pushing would still use regular counted pack files.
  - Have the pack file checksum be the SHA1 of (the count followed
    by the SHA1 of the compressed data of each object). This would  
allow 3.
    to be done without reading back all data.

  -Geert
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html