Re: [PATCH v0 3/3] Bigfile: teach "git add" to send a large file straight to a pack

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 09 May 2011 08:58:56 -0700

Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> The other problem here is the caller cannot access the written objects
> until the pack is closed. That is one of the things that has made
> fast-import difficult for git-svn to use, because git-svn expects the
> object to be available immediately. I assume that within a single git
> add or git update-index process we don't need to worry about this, so
> its probably a non-issue.

Yes, it is part of a possible issue to be addressed in the plan.

I envisioned that the "API" I talked about in the NEEDSWORK you quoted
would keep an open file descriptor to the "currently being built" packfile
wrapped in a "struct packed_git", with an in-core index_data that is
adjusted every time you add a straight-to-pack kind of object. Upon a
"finalize" call, it would determines the final pack name, write the real
pack .idx file out, and rename the "being built" packfile to the final
name to make it available to the outside world.

Within a single git process that approach would give access to the set of
objects that are going straight to the pack.  When it needs to spawn a git
subprocess, it however would need to finalize the pack to give access to
the new object, just like when fast-import flushes when asked to expose
the marks.

After all, this topic is about handling large binary files that would not
fit in core at once (we do not support them now at all). It may not too
bad to say we stuff one object per packfile and immediately close the
packfile (which is what the use of fast-import by the POC patch
does). Once the packfile is closed, the object in it is automatically
available to the outside world, and it is just the matter of making a
reprepare_packed_git() call to make it available to ourselves. When there
are many such objects, as they would exceed bigfilethreashold, repacking
them would just amount to copying the already compressed data literally (I
haven't re-checked the code though) and the cost shouldn't be more than
proportional to the size of the data. Expecting any system to do better
than that is asking for moon and I am not willing to bend backwards to
cater to such demands before running out of other better things to do ;-).

So I am tempted to keep the "spawn an external fast-import" code at least
for now, and give it a higher priority to make the other side (writing out
the blob to a working tree) streamable.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html