On Feb 27, 2007, at 11:11, Shawn O. Pearce wrote:
Geert Bosch <bosch@xxxxxxxxxxx> wrote:
When I import a large code-base (such as a *.tar.gz), I don't know
beforehand how many objects I'm going to create. Ideally, I'd like
to stream them directly into a new pack without ever having to write
the expanded source to the filesystem.
See git-fast-import. If you are coming from a tar, also see
contrib/fast-import/import-tars.perl. :-)
Yes, I saw that, really nice. I had written something myself to
create pack files from a streaming data source. Basically, I'm
breaking arbitrary data-streams (mostly backups) into chunks
along content-defined boundaries and then link the chunks
together in a tree. This eliminates any duplicate files,
and even chunks of larger files that are identical. I create
new pack files whenever the old one gets larger than a certain
predefined size (128 MB, currently).
I'm happy I can base this on git-fast-import now.
So for creating a large pack from a stream of data, you have to do
the following:
1. write out a temporary pack file to disk without correct count
2. fix-up the count
3. read the entire temporary pack file to compute the final SHA-1
4. fix-up the SHA1 at the end of the file
5. construct and write out the index
Yes, this is exactly what git-fast-import does. Yes, it sort of
sucks. But its not as bad as you think.
For smaller packs, the I/O is all going to be buffered anyway,
but if we're going to have >4GB pack files, it adds a lot of real
I/O and SHA1 computation for no good reason. If we get a rare chance
to have a new pack format, why not fix this wart at the same time?
-Geert
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html