Re: [RFC PATCH 1/1] Document a fixed tar format for interoperability

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 08 Feb 2023 12:07:44 +0100

On Tue, Feb 07 2023, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2023-02-06 at 22:18:47, Ævar Arnfjörð Bjarmason wrote:
>> Maybe there are other changes in the proposed spec that put it at odds
>> with such a goal, it's unclear to me if this is the only difference.
>
> As mentioned in the description, that doesn't address trees, which have
> never been consistent traditionally.

You mention "[...]it produces identical results for identical trees,
regardless of hash algorithm". I'm not familiar with how we encode trees
differently based on the hash algorithm. Do we stick the tree OID in
there somewhere, or is it something else?

IOW do these trees vary within the same hash algorithm, or is it another
special-case where we now produce a different tarball with SHA-1 and
SHA-256 with commits, but also with trees?

B.t.w. are there some options to tar(1) to make it dump these headers
you're describing? I coludn't find anything when looking, it looks like
libtar might support it, but I was hoping for something more compatible
with my lazyness :)

> We also have bad permissions for pax headers (always 666), which is
> something we've tried to fix before and is not something we want to
> carry on with.

I'm concerned that you're expanding the scope of a "stable" tar format
to necessarily include one-off fixing various things we've regretted
over the years.

Maybe that *needs* to happen, but so far I don't see why, you've
described:

* We include the OID in the metadata
* Something like that, but for trees?
* The sucky 0666 permissions we'd like to fix.
* We don't serialize timestamps (which is now optional, depending on how
  you invoke it)

I really applaud your efforts here, but I don't see if that's the extent
of the changes why the v1 and default format shouldn't be something that
produces identical results to "git archive" as it stands today.

Then a v1/v2 is just this pseudocode, isn't it?

 	switch (version) {
	case 1:
		break; /* warts and all */
	case 2:
		include_oid = 0;
		satanic_permissions = 0;
		no_timestamps = 1;
		break;
	}

The reason we haven't promised to support an archive format isn't
because we didn't find every aspect of it aesthetically pleasing, but
because we didn't want to commit to some bug-for-bug compatibility with
whatever the code is doing right now.

Now that you've done the work to specify it, it turns out that a
proposed format you'd like going forward is almost identical to what we
currently emit, to the point that supporting that as a v1 seems rather
trivial (but again, I may still be missing something).

We have a huge long-tail of users in the wild, forcing those users to go
through a one-time breakage of their existing archives if we could avoid
that by making v1 the current format seems entirely unnecessary.

I totally see your point about wanting byte-for-byte the same archives
out of the SHA-1 and SHA-256 version of the same repo, I think that's a
good goal, and it's also a good goal to get rid of these other warts.

But I don't see why it needs to be required, or even the default.

> You specifically sent a patch stating that we're not guaranteeing that
> format, and I agree with that assessment.  I'm proposing a format that
> we would guarantee and which does not have any of the historical baggage
> or warts that we don't want to keep.

Per the above I just don't see why that's a criteria. I think we should
be weighing the benefits of changing the existing default "git archive"
output v.s. the cost of maintaining the delta to some v2 wart-less
format.

> This format also doesn't serialize timestamps; everything is at the
> Epoch.  Again, that's because serializing a commit and its tree or even
> a tag and its commit would produce different results.

This seems like further scope creep, and in this particular case I don't
see how *always* doing that helps you with reputability.

I.e. for the cases where we're now given a top-level tree it's obvious
how this helps, we encode the time(), so every time it's different.

But in the case where we get a commit (or tag) ID we use the timestamp
in the commit (or tag?) envelope.

When producing a release archive, or packing up a given commit that's
therefore going to be stable, even between SHA-1 and SHA-256, although
those two would differ if the OID is put in the header, but that's
another matter.

If I understand you correctly here you seem to be in pursuit of another
goal entirely, which is that you'd like the same output for different
commits if they're TREESAME.

Or, if you have a bunch of release archives a very nice attribute of
this is that with a bunch of similar archives on the same FS you could
e.g. benefit more from block-level deduplication.

All of which is cool, but I don't see why it needs to be a hard
requirement in the design.

>> But I don't see why we need bit-for-bit compatible output between SHA-1
>> and SHA-256 git repos for the reasons noted in the linked-to reply, and
>> removing this will remove a *really useful* aspect of our tar format,
>> which is that you can grab an arbitrary tarball, and see what commit
>> it's produced from.
>
> True, but this is a highly obscure feature and I've never used it
> outside of testing.

I admit that's a bit obscure, but one of those things that really comes
in handy when you need it, I vaguely remember using it once or twice and
being very happy it was there.

But related to that is setting everything to epoch:0, doesn't that mean
that when you unpack say a release archive that in common filesystem
browsers all of the files will be dated in the 70s, as opposed to the
time of release as it is now?

> If you want it, you can have it: you just want the
> default format, which serializes it in the header, and not the extremely
> restricted format I'm proposing here which is designed to never ever
> change.

Okey, so I might have to take back much of what I said about, so you're
not opposed to supporting the current format as a "v1" or whatever,
you'd just like this propsoed "v2" (or "vstable", or whatever) to have
some "blessed" status.

I just don't get why we wouldn't support both, if the delta is as small
as seems to be the case. If that's right this "v2" is less "extremely
restricted" to our current "v1", and more "almost identical", just "a
bit less wart-y".

> We might well decide to add cool new features and useful
> information to the default format, but this one will be fixed forever.

I just don't see the target audience for that. As the issues that
prompted these on-list discussions show we have people in the wild who
deeply care about the current format.

They probably care enough about that that we're likely to try to support
that forever, at least I don't see any currently proposed change to the
format that seems worth breaking things for those users.

So, if that's the case we'd have a v1 (current), this "vstable" (never
changes), and a v2 (v1 + extra neat things), etc.

Then we'd be maintaining 3 formats instead of 2 formats (a "v1" and
"vunstable").

>> Even if you want to retain SHA-1 and SHA-256 interop as far as tar is
>> concerned, an un-discussed alternative is to just stick the SHA-1 OID
>> into the SHA-256 archive.
>> 
>> For repos that are migrated we envision having such a bi-directional
>> mapping anyway.
>> 
>> And for those that started out as SHA-256, or where we no longer care
>> about compatibility with old SHA-1, we can just start including the
>> SHA-256 OID, as all compatibility concerns have gone away when we
>> stopped bothering to maintain the mapping, no?
>
> Whether SHA-1 or SHA-256 or both are present in the repo is a local
> decision.  The transition plan specifically anticipates people either
> preferring one hash or the other in output.  The behaviour is not "use
> SHA-1 if there's SHA-1 and use SHA-256 otherwise", because even if
> everyone has SHA-256 and prefers it on their system, some people may
> still have SHA-1 for historical reasons and that would lead to different
> output.

Yes, but who has this issue in practice? In practice people are
producing archives as part of some release process.

As long as they keep using SHA-1 to make release they're fine, at some
point they'll switch over to SHA-256 by default, and then their new
releases will use SHA-256.

If they then have to for some reason go back to an old commit when SHA-1
was the default it might be a tiny hassle, but no more than doing the
same if the format had changed entirely.

> Part of this is because I anticipate that once the interop work is done,
> GitHub may transition repositories on the server to SHA-256 with SHA-1
> interop for existing SHA-1 repositories.  People are still going to have
> a fit if tarball data breaks at some point because the repository owner
> decided to flip the default hash algorithm, and I'm specifically
> proposing a format that is not going to direct hordes of angry users in
> my direction or the repository owner's in that case.  Lots of people are
> going to avoid switching the default hash algorithm if it breaks
> tarballs, and I specifically don't want to encourage people sticking
> with SHA-1 for that reason.

I see that, I don't see how your plan isn't a perfect recipe for
creating the problem you're trying to avoid.

You have tarballs generated with the current format today, 3rd party
systems are dynamically downloading e.g. v1.0.0.tar.gz or whatever, and
expecting it to byte-for-byte match previous downloads.

If you're going to switch to some stable format surely that would either
need to involve massive one-off breakage, or you'd have some "flag day",
from today all new archives are produced with the new "stable" method.

If that "stable" format is different (among other things, but the others
seem equally trival) because you wanted to extract the OID from the
format for SHA-1 and SHA-256 interop, why can't the day the repository
switched to SHA-256 be that flag day?