Re: Working with git binary stream

Jeff King <peff@xxxxxxxx> · Mon, 9 Aug 2021 17:07:46 -0400

On Mon, Aug 09, 2021 at 07:12:13PM +0300, anatoly techtonik wrote:

> As an alternative it appeared that that theres is also a
> "git binary stream" log that is produced by
> 
> git cat-file --batch --batch-all-objects
> 
> Is there a way to reconstruct the repository given that stream?

Yes, though it is probably not the easiest way to do so. Just dumping
all of the object contents back into another repository will indeed give
you the same hashes, etc. But if you change one object, then all its
hash will change, and all of the other objects pointing to it will need
to change, etc. And that dump is in apparently-random order with respect
to the actual graph structure and relationship between objects.

You'd probably do better to build a tool around rev-list, and only use
cat-file to fetch the verbatim object contents. At some point your tool
would start to look a lot like fast-export/fast-import, and it may be
less work to teach them whatever features you need to avoid any
normalization (e.g., retaining signatures, encodings, etc).

> Is there documentation on how to read it?

The output format is described in the "BATCH FORMAT" section of "git
help cat-file". Basically you get each object id, type, and size in
bytes, followed by the object contents. You can use the size from the
header to know how many bytes to read.

There's no tool to accept the whole stream. You'd have to parse each
entry and feed it to "git hash-object" with the appropriate type.

Having a mode to hash-object to read in a bunch of objects in "cat-file
--batch" format wouldn't be unreasonable, but nobody has found a need
for it so far. It would also be quite slow (it writes out individual
loose objects, whereas something like fast-import writes out a packfile,
including at least a basic attempt at deltas).

-Peff