Re: Making bit-by-bit reproducible Git Bundles?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote:

> >   2. There is no way to pass pack-objects options down through
> >      git-bundle. So you'd have to either assemble the bundle yourself,
> >      or perhaps generate a stable on-disk pack state, and then generate
> >      the bundle. Perhaps something like:
> >
> >        # make one single pack, with no reuse, using the default options
> >        git -c pack.threads=1 repack -adf
> 
> Yay!  You may have solved this for me.  I have to verify this a bit
> more, but this looks promising (these are two different git clones):
> 
> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
> jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle 
> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890  gnulib.bundle
> jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
> jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle 
> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890  gnulib.bundle
> jas@kaka:~/t/gnulib-2$ 

One thing to watch out for here: that repack is going to look at _all_
objects in the repository. So you will get different output if you make
a bundle of a tag "v1.0" today than you would get later, when "v1.1"
also exists. Ditto for any other activity in the repository, like writes
to unrelated branches, or even reflog entries.

So you'd probably want to make an absolute minimal repository with the
reachable objects, perhaps like:

  git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git
  cd just-v1.0.git
  git -c pack.threads=1 repack -adf

It doesn't have to be just one ref, of course; you might want to
snapshot the whole set of refs at the time you make the bundle. E.g., by
fetching into the empty repo using a refspec.

This would all be a non-issue if you could ask git-bundle to directly
pass the equivalent of "-f" to pack-objects (at that layer it is called
"--no-reuse-delta"). Since then it would be computing the full set of
objects itself. But without a patch to Git, I don't think there's a way
to do that.

The bundle format is pretty simple, so you _could_ hack around it
yourself, like:

  # list refs we care about; you can pick whatever subset you want
  # here.
  git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs

  {
	# bundle header plus list of refs, plus blank line terminator
	echo "# v2 git bundle"
	cat refs
	echo

	# and now the pack. We just need to feed it the object ids for
	# all of the refs. It will handle sorting and de-duping for us.
	cut -d' ' -f1 <refs |
	git -c pack.threads=1 pack-objects \
		--stdout --revs --delta-base-offset --no-reuse-delta
  } >foo.bundle

I dunno if that is more or less gross than teaching git-bundle to pass
--no-reuse-delta itself. It's certainly more intimate with the details,
but OTOH it is less likely to change in other versions of Git (e.g., if
we started making "v3" bundles by default).

> >   # print all commits in topological order, with ties broken by
> >   # committer date, which should be stable. And then follow up with the
> >   # trees and blobs for each.
> >   git rev-list --topo-order --objects HEAD >objects
> >
> >   # now print the contents of each object (preceded by its name, type,
> >   # and length, so there's no chance of weird prepending or appending
> >   # attacks). We cut off the path information from rev-list here, since
> >   # the ordered set of objects is all we care about.
> >   cut -d' ' -f1 objects |
> >   git cat-file --batch >content
> >
> >   # and then take a hash over that content; this will be unambiguous.
> >   sha256sum <content
> 
> How to read this output?  Could this be made git bundle compatible?

You'd have to compare the result of doing that after fetching from the
bundle into an empty repo. I don't think there's a great way to operate
directly on the bundle packfile (it has to be indexed first to see
what's in it).

The closest I could get is:

  input=foo.bundle

  # split the bundle into header and packfile sections on the first
  # blank line
  sed '/^$/q' <$input >header
  size=$(stat --format=%s header)
  tail -c +$((size+1)) <$input >bundle.pack

  # we can first do a byte-level comparison of the header; if this isn't
  # the same, the bundles do not match.
  sha256sum <header

  # now index the pack, so we know what's in it; this makes bundle.idx
  git index-pack -v bundle.pack

  # and now we want to dump the full logical contents (not the
  # delta-compressed versions) of each object. First we need a list of
  # the objects. This will come out in lexical order of object id, which
  # is good for us since it will be stable.
  git show-index <bundle.idx  | awk '{print $2}' >objects

  # unfortunately here things break down. There is no command to read
  # the data directly out of the pack/idx pair without a repository
  # (even though it could be done technically). So we hack around it
  # with a temp repo.
  git init --bare tmp.git
  mv bundle.idx bundle.pack tmp.git/objects/pack/
  git -C tmp.git cat-file --batch <objects | sha256sum

So...also kind of gross. And not really all that different than what:

  git init --bare tmp.git
  cd tmp.git
  git fetch ../foo.bundle refs/*:refs/*

would do (you end up with the same pack/idx pair). So I dunno. I guess
it depends how many and which Git commands you're willing to trust. ;)

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux