Re: Making bit-by-bit reproducible Git Bundles?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 12, 2025 at 12:40:05PM +0100, Simon Josefsson wrote:

> If I run the recipe above twice (including the clone), I get different
> checksums.  This even if nothing was committed in the remote repository
> meanwhile.
> 
> Is it possible to create a bit-by-bit reproducible git bundle using some
> other set of commands?  If so, how?  I'm using git 2.48.1 from Guix.

As Junio noted, multithreading is the first problem. E.g., here are some
commands on git.git, using my 8-core machine:

  [try once...]
  $ git bundle create --no-progress - HEAD | sha1sum
  686da850200da487032c9d91bdc544b605a3e426  -

  [and again; oops, it's different]
  $ git bundle create --no-progress - HEAD | sha1sum
  70b018c16d244f32b36e55deb931e29ae15506e3  -

  [now without threading]
  $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
  c897caf9c68d2c37d997d3973196886af3b0b46e  -

  [and we can do it again. yay!]
  $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
  c897caf9c68d2c37d997d3973196886af3b0b46e  -

What's happening here is that the bundle mostly consists of a packfile,
where many objects will be stored as deltas against others. The search
for deltas is multi-threaded, so it will find slightly different ones
each time (there surely is an "optimal" answer, but finding it is much
too expensive, so we bound the search with some heuristics).

So disabling threading gives you a deterministic answer. But that's not
the end of the story! We only search for deltas of objects that are not
already stored as deltas in on-disk packfiles. We try to reuse any
deltas we have already on disk (assuming that both the delta and its
base are going to be in the output).

There are options to ask pack-objects (the command which git-bundle uses
under the hood to generate the pack) not to reuse deltas. So
pack-objects running on a single thread without any delta reuse should
generate a deterministic pack. But there are some gotchas:

  1. It's stable only for a given Git version, and with a particular set
     of delta window/depth options. I wouldn't expect behavior to change
     much between versions, but it's not something that we try to
     guarantee.

  2. There is no way to pass pack-objects options down through
     git-bundle. So you'd have to either assemble the bundle yourself,
     or perhaps generate a stable on-disk pack state, and then generate
     the bundle. Perhaps something like:

       # make one single pack, with no reuse, using the default options
       git -c pack.threads=1 repack -adf

       # now we can make a bundle from that. We probably do not even
       # need to disable threads here, since we'd just be picking the
       # deltas from the on-disk file (assuming that you're including
       # all objects in the bundle)
       git bundle create - | sha1sum

  3. It will be really slow. We're throwing out all of the deltas and
     searching from scratch. And doing it single-threaded. I didn't time
     it, but I'd guess from past experience we're talking about hours to
     generate the bundle for something like linux.git.

So I think it's possible, but I doubt it's very ergonomic. You're
probably better off using some checksum over Git's logical model, rather
than the stored bytes. The obvious one is that a single Git commit hash
unambiguously represents the whole tree and all of history leading up to
it, because of the chains of hashes.

But that implies you trust Git's object hash algorithm. If you don't
trust sha1 (and don't want to try out the sha256 support), then you'd
have to design something else.  Perhaps something like:

  # print all commits in topological order, with ties broken by
  # committer date, which should be stable. And then follow up with the
  # trees and blobs for each.
  git rev-list --topo-order --objects HEAD >objects

  # now print the contents of each object (preceded by its name, type,
  # and length, so there's no chance of weird prepending or appending
  # attacks). We cut off the path information from rev-list here, since
  # the ordered set of objects is all we care about.
  cut -d' ' -f1 objects |
  git cat-file --batch >content

  # and then take a hash over that content; this will be unambiguous.
  sha256sum <content

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux