Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution

Eli Schwartz <eschwartz93@xxxxxxxxx> · Tue, 31 Jan 2023 10:56:52 -0500

On 1/31/23 4:54 AM, brian m. carlson wrote:
> Part of the reason I think this is valuable is that once SHA-1 and
> SHA-256 interoperability is present, git archive will change the
> contents of the archive format, since it will embed a SHA-256 hash into
> the file instead of a SHA-1 hash, since that's what's in the repository.
> Thus, we can't produce an archive that's deterministic in the face of
> SHA-1/SHA-256 interoperability concerns, and we need to create a new
> format that doesn't contain that data embedded in it.

I assume that whatever the reason for originally embedding the OID into
the file is still an applicable reason even if a new PAX format is
established for the use of git-archive.

It may not be a great reason -- I don't know. Perhaps there's an
argument to remove it. But can't that be done irrespective of
standardizing the PAX format?

...

I'm not deeply knowledgeable about the SHA-256 transition work -- or
knowledgeable at all about it, frankly. (Also my understanding was it
seems to have stalled as discussed in https://lwn.net/Articles/898522/
-- I understand that you're still enthusiastic about the work? But that
doesn't really answer "is there a timeframe for that to ever happen".)

But I sort of assumed that the transition work would already have to
embed a fair bit of information into the repository about the whole
process? Would it not be possible to determine whether a given tag
started life as SHA-1 or SHA-256? Maybe even just a date when the
repository was converted to work with both, and embed the OID based on
whether the tag is tagging contents that were created after that conversion?

Seems to me like the problem should be solvable if people want to solve it.

...

git-archive run on a commit obviously doesn't have this problem -- it
can simply embed the OID for the same argument it was called with. But I
assume it's far more common to access tag-based github endpoints. :D

> Having said that, I don't think this should be based on the timestamp of
> the file, since that means that two otherwise identical archives
> differing in timestamp aren't ever going to be the same, and we do see
> people who import or vendor other projects. 

The timestamp of the output file? Surely not. But I only suggested the
timestamp of the commit/tag metadata that git-archive is asked to
produce output for. And we would need that in order to solve the problem
that reproducible github API archive endpoints poses.

I'm not sure what the "import or vendor other projects" angle here
means. Do you mean people who copy a directory of files into their
project? Who expects this to be the same to begin with? And doesn't
embedding the OID kill this idea, since the entire point of git commit
sha's is that you shouldn't (it should be prohibitively unrealistic to)
be able to produce the same one twice in different contexts?

I have never said to myself "ah yes, I really would like to be able to
download a git auto-generated tarball for project A, and compare its
hash to the tarball for project B, and have them compare identical even
though they are different projects with different commits". IMHO this
isn't an interesting problem to solve -- the interesting problem to
solve is that a single absolute URL to a downloadable file should be
able to offer documented guarantees that it will always be the same
file, even though it is generated on the fly.

> Nor do I think we should
> attempt to provide consistent compression, since I believe the output of
> things like zlib has changed in the past, and we can't continually carry
> an old, potentially insecure version of zlib just because the output
> changed.  People should be able to implement compression using gzip,
> zlib, pigz, miniz_oxide, or whatever if they want, since people
> implement Git in many different languages, and we won't want to force
> people using memory-safe languages like Go and Rust to explicitly use
> zlib for archives.

I do not think it is realistic or reasonable for people to implement
compression using intentionally incompatible replacements for gzip and
expect interoperability of any sort.

I also don't think people *have* to implement compression in rust using
zlib, but if they are going to make a git-alike that produces archives,
it would be worth it for them to write whatever memory-safe rust is
necessary to memory-safely produce the same output stream of bytes. It's
no less feasible than making sure that busybox gzip and GNU gzip produce
the same output, surely.

Alternatively, they could just not bother with gzip at all, and make
their git-alike produce zstd-compressed tarballs, which change their
byte outputs every time a new zstd release is published. :D Again, why
limit yourself to gzip if you want to be innovative anyway.

> That may mean that it's important for people to actually decompress the
> archive before checking hashes if they want deterministic behaviour, and
> I'm okay with that.  You already have to do that if you're verifying the
> signature on Git tarballs, since only the uncompressed tar archive is
> signed, so I don't think this is out of the question.

This is a very kernel.org-centric view of things, I think. I have rarely
seen PGP signatures applied to the uncompressed tar except in that
context. The vast majority of tarballs with signatures have signed a
single compressed tarball and don't concern themselves with, say,
providing a rotating backdated changeable list of compression formats
with a single signature covering all of them.

Nevertheless, in order to handle kernel.org-style tarballs, you are
entirely correct that one should be able to handle this.

>From experience, I can say that this needs to be selected on a
per-tarball basis. Since signature files have filenames, we can match
their stems and given foo.tar.asc and foo.tar.gz, check the signature of
the output of gzip -dc < foo.tar.gz, but given foo.tar.gz.asc and
foo.tar.gz, simply check the signature of the original foo.tar.gz.

This doesn't really work for checksums, because you need to settle on
one or the other everywhere or else embed decompression information into
your checksum metadata field.

And for tarballs that are generated once and uploaded to ftp storage,
not repeatedly generated on the fly, we know the checksum will never
legitimately change, so we *want* to hash the compressed file.
Decompressing kernel.org tarballs in order to run PGP on them is *slow*.
Although at least one can verify the checksums first without
decompression, which is virtually guaranteed to catch invalid source
code releases, so if you ever progress to the PGP verification stage
it's unlikely to be wasted effort -- that tarball is definitely getting
used to build something.

-- 
Eli Schwartz