Re: [PATCH] Support 64-bit indexes for pack files.

Geert Bosch <bosch@xxxxxxxxxxx> · Tue, 27 Feb 2007 11:55:48 -0500

On Feb 27, 2007, at 11:11, Shawn O. Pearce wrote:
Geert Bosch <bosch@xxxxxxxxxxx> wrote:
When I import a large code-base (such as a *.tar.gz), I don't know
beforehand how many objects I'm going to create. Ideally, I'd like
to stream them directly into a new pack without ever having to write
the expanded source to the filesystem.

See git-fast-import.  If you are coming from a tar, also see
contrib/fast-import/import-tars.perl.  :-)

Yes, I saw that, really nice. I had written something myself to
create pack files from a streaming data source. Basically, I'm
breaking arbitrary data-streams (mostly backups) into chunks
along content-defined boundaries and then link the chunks
together in a tree. This eliminates any duplicate files,
and even chunks of larger files that are identical. I create
new pack files whenever the old one gets larger than a certain
predefined size (128 MB, currently).

I'm happy I can base this on git-fast-import now.

So for creating a large pack from a stream of data, you have to do
the following:
  1. write out a temporary pack file to disk without correct count
  2. fix-up the count
  3. read the entire temporary pack file to compute the final SHA-1
  4. fix-up the SHA1 at the end of the file
  5. construct and write out the index

Yes, this is exactly what git-fast-import does.  Yes, it sort of
sucks.  But its not as bad as you think.

For smaller packs, the I/O is all going to be buffered anyway,
but if we're going to have >4GB pack files, it adds a lot of real
I/O  and SHA1 computation for no good reason. If we get a rare chance
to have a new pack format, why not fix this wart at the same time?

  -Geert
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html