Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023-01-31 at 00:06:44, Eli Schwartz wrote:
> Nevertheless, I've seen the sentiment a few times that git doesn't like
> committing to output stability of git-archive, because it isn't
> officially documented (but it's not entirely clear what the benefits of
> changing are). And yet, git endeavors to do so, in order to prevent
> unnecessary breakage of people who embody Hyrum's Law and need that
> stability.

I'm one of the GitHub employees who chimed in there, and I'm also a Git
contributor in my own time (and I am speaking here only in my personal
capacity, since this is a personal address).  I made a change some years
back to the archive format to fix the permissions on pax headers when
extracted as files, and kernel.org was relying on that and broke.  Linus
yelled at me because of that.

Since then, I've been very opposed to us guaranteeing output format
consistency without explicitly doing so.  I had sent some patches before
that I don't think ever got picked up that documented this explicitly.
I very much don't want people to come to rely on our behaviour unless we
explicitly guarantee it.

> What does everyone think about offering versioned git-archive outputs?
> This could be user-selectable as an option to `git archive`, but the
> main goal would be to select a good versioned output format depending on
> what is being archived. So:
> 
> - first things first, un-default the internal compressor again
> - implement a v2 archive format, where the internal compressor is the
>   default -- no other changes
> - teach git to select an archive format based on the date of the object
>   being archived
>   - when given a commit/tag ID to archive, check which support frame the
>     committer date falls inside
>   - for tree IDs, always use the latest format (it always uses the
>     current date anyway)
> - schedule a date, for the sake of argument, 6 months after the next
>   scheduled release date of git version X.Y in which this change goes
>   live; bake this into the git sources as a transition date, all commits
>   or tags generated after this date fall into the next format support
>   frame

I am actually very much in favour of providing a standard, deterministic
version of pax (the extended tar format) that we use and documenting it
as a standard so that other archive tools can use that.  That is, we
document some canonical tar format that is bit-for-bit identical that we
(and hopefully GNU tar and libarchive) will agree should be used to
serialize files for software interchange.  I don't think this should be
dependent on the date at all, but I do believe it should be versioned
and tested, and the version number embedded as a pax header.  I think
this would be valuable for simply having reproducible archives in
general, including for things like Docker containers, Debian packages,
Rust crates, and more, and I'm happy to work with others on such a
format, as I've said in the past on the list.  People can opt-in to
whatever format they want when creating an archive and continue to use
that forever if they like.

Part of the reason I think this is valuable is that once SHA-1 and
SHA-256 interoperability is present, git archive will change the
contents of the archive format, since it will embed a SHA-256 hash into
the file instead of a SHA-1 hash, since that's what's in the repository.
Thus, we can't produce an archive that's deterministic in the face of
SHA-1/SHA-256 interoperability concerns, and we need to create a new
format that doesn't contain that data embedded in it.

Having said that, I don't think this should be based on the timestamp of
the file, since that means that two otherwise identical archives
differing in timestamp aren't ever going to be the same, and we do see
people who import or vendor other projects.  Nor do I think we should
attempt to provide consistent compression, since I believe the output of
things like zlib has changed in the past, and we can't continually carry
an old, potentially insecure version of zlib just because the output
changed.  People should be able to implement compression using gzip,
zlib, pigz, miniz_oxide, or whatever if they want, since people
implement Git in many different languages, and we won't want to force
people using memory-safe languages like Go and Rust to explicitly use
zlib for archives.

That may mean that it's important for people to actually decompress the
archive before checking hashes if they want deterministic behaviour, and
I'm okay with that.  You already have to do that if you're verifying the
signature on Git tarballs, since only the uncompressed tar archive is
signed, so I don't think this is out of the question.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux