Re: [PATCH] Support 64-bit indexes for pack files.

Nicolas Pitre <nico@xxxxxxx> · Tue, 27 Feb 2007 12:03:56 -0500 (EST)

On Tue, 27 Feb 2007, Geert Bosch wrote:

> 
> On Feb 27, 2007, at 00:11, Nicolas Pitre wrote:
> 
> > > BTW, here are a few issues with the current pack file format:
> > > - The final SHA1 consists of the count of objects in the file
> > >   and all compressed data. Why? This is horrible for streaming
> > >   applications where you only know the count of objects at the
> > >   end, then you need to access *all* data to compute the SHA-1.
> > >   Much better to just use compute a SHA1 over the SHA1's of each
> > >   object. That way at least the data streamed can be streamed to
> > >   disk. Buffering one SHA1 per object is probably going to be OK.
> > 
> > We always know the number of objects before actually constructing or
> > streaming a pack.  Finding best delta matches require that we sort the
> > object list by type, but for good locality we need to re-sort that list
> > by recency.  So we always know the number of objects before starting to
> > write since we need to have the list of objects in memory anyway.
> 
> When I import a large code-base (such as a *.tar.gz), I don't know
> beforehand how many objects I'm going to create. Ideally, I'd like
> to stream them directly into a new pack without ever having to write
> the expanded source to the filesystem.

Have a look at git-fast-import and contrib/fast-import/import-tars.perl.

Regardless, you cannot produce a decent pack without knowing in advance 
how many objects you have.  Thisis why git-fast-import produces a 
suboptimal pack that needs to be repacked in the end.

> The object-count at the beginning of the pack is a little strange for
> local on-disk pack files, as it is data that can easily be derived.
> The *index* would seem to be the proper place for this.

The index is _not_ sent over the network protocol by design.  But like I 
said the receiving end wants to know up front how many objects it'll 
have to parse.

> Also, it is not possible to write a dummy 0 in the count and then fill in
> the correct count at the end, because the final SHA1 at the end of the pack
> file is a checksum over the count followed by all the pack data.
> So for creating a large pack from a stream of data, you have to do the
> following:
>  1. write out a temporary pack file to disk without correct count
>  2. fix-up the count
>  3. read the entire temporary pack file to compute the final SHA-1
>  4. fix-up the SHA1 at the end of the file
>  5. construct and write out the index

Sure.  That's in fact what index-pack does when it has to fix up a thin 
pack. But did you find any _real_ issue with this?  Resolving deltas and 
recomputing the index is far more costly than simply redoing the whole 
pack checksum.

> There are a few ways to fixing this:
>  - Have a count of 0xffffffff mean: look in the index for the count.
>    Pulling/pushing would still use regular counted pack files.
>  - Have the pack file checksum be the SHA1 of (the count followed
>    by the SHA1 of the compressed data of each object). This would allow 3.
>    to be done without reading back all data.

Well we _could_ exclude the object count (or even the entire pack 
header for that matter) from the pack checksum.  This way a streamed 
pack streamed to disk could be seeked back to have its object number 
fixed up once it is known without needing to start the checksum all 
over.  Maybe something to consider for pack v4.  But in practice I don't 
see that as a big issue.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html