Re: git archive should use vendor extension in pax header

René Scharfe <l.s.r@xxxxxx> · Mon, 15 Feb 2016 21:25:40 +0100

Am 06.02.2016 um 15:57 schrieb fuz@xxxxxx:
On Sat, Feb 06, 2016 at 02:23:11PM +0100, René Scharfe wrote:
Am 28.01.2016 um 00:45 schrieb fuz@xxxxxx:
There is git get-tar-commit-id, which prints the commit ID if it
finds a comment entry which looks like a hexadecimal SHA-1 hash.
It's better than a hex editor at least. :)

This is incredibly fuzzy and can get wrong for a pleothora of reasons.
I hope you agree though that the situation is suboptimal, git is doing
the equivalent of using a custom file format without an easily
recognizable magic number.

It is fuzzy in theory. But which other programs allow writing a
comment header?  I'm not aware of any, but I have to admit that I
didn't look too hard.

Well, let's say what happens if the Mercurial folks were to implement
the same thing? Suddenly there is a conflict. Yes, of course, right now
there might be no program that uses the comment field for its own
purpose but such design decisions tend to be not future proof. There is
a very good reason why file formats typically have magic numbers and
don't just rely on people knowing that the file has a certain type and
that is the same reason why git should mark its meta data in a unique
fashion.

Chances are good that Mercurial would do it in a way that doesn't 
conflict with git's tar comments.  I get your point, though, and agree 
that it's not ideal.  However, so far it's just a potential problem.

But I'm still interested how you got a collection of tar files with
unknown origin.  Just curious.

Easy: Just download the (source) distribution archives of a distribution
of choice and try to verify that the tarballs they use to compile their
packages actually come from the project's public git repositories.

OK, that's easier than calculating checksums and comparing them with
those published by the respective projects, but also less
trustworthy.

If you have a known trusted archive, you could use it directly, no need
for cross-verification. The intent is to be able to check if archives
generated by someone from some sources could have plausibly been
generated from these sources.

It's probably not too important, but I think I still don't fully 
understand.  So you have a tar file of unknown origin.  You hand it to 
git get-tar-commit-id or a similar tool and get back 
a08595f76159b09d57553e37a5123f1091bb13e7.  You can google this string 
and find out it's the commit ID for git v2.7.1.

Your tar file could have been modified in various ways, though, e.g. 
with tar u or tar --delete.  So you try to find a download site for the 
software that includes file hashes for archives of this release, like in
https://www.kernel.org/pub/software/scm/git/sha256sums.asc.

If the published hash and a hash of your file match then you can be 
reasonably sure the files are the same.  If they don't then it could be 
due to variations added by the compressor.  You can download the 
authoritative archive and compare it with yours.

Is that how it goes?

I'm very interested in hearing about any git specific bugs.

I don't know any. Bugs tens to be known only after 1000s of buggy
archives have been published (just as with some GNU tar bugs). It's
great to have a way to detect that the archive might be affected by
a bug so you know that you need to work around it.

That requires a field containing the git version which was used to 
create the archive, no?

Thinking about the problem a bit more and discussion with the
aforementioned Jörg Schilling we came to the conclusion that the best
way to deal with an “file omitted” attribute is to attach it to the
directory that would normally contain the omitted file.

Sounds sensible, but the ordering can be a bit tricky.  If d/a is 
included and d/b is not then it would be easy to write d/, d/a and the 
extended header that says that d/b is excluded, in that order.  Writing 
the extended header first is a bit harder and I'm not sure if it's 
needed.  And it gets tricky if more than one entry is excluded per 
directory. (Just thinking out loud here.)

Letting archivers extract meta data as regular files is annoying to
those that are not interested in it.  Extended headers themselves
(type g) are bad enough already in this regard for those stuck with
old tar versions.

I think we can safely assume that systems support pax headers 15 years
after they have been standardized. I was actually unable to find a
non-historical version of a serious archiver that claims to support tar
archives but doesn't support pax headers.

Well, that depends on your definition of "serious".  Plan 9's tar 
perhaps doesn't fit it, but what about 7-Zip (http://www.7-zip.org/)?

And there is no way (or did I overlook it?) to modify or display the 
comment extended header using GNU tar.  That's actually surprising to 
me: I'd think the ability to add a human-readable description to a 
backup on tape is quite important.  (But I didn't touch an actual tape 
for quite a while, and I never used tar directly with them.)

The GIT.path option holds the paths that are being archived. It is a bit
tricky to get right.  The intent of POSIX pax headers is that each key
is an attribute that applies to a series of files.  In the case of a
global header, each key applies until it is overridden with a new
header or with a local header.  A GIT.path key should only apply to the
files that correspond to this path operant to git archive.  Thus, a new
GIT.path should be written frequently.  There should always be at least
one GIT.path.

That's for the optional path parameters of git archive, right?  A
list of included paths (GIT.include) would be simpler and should
suffice, no?

No.  Again: An attribute in a pax header pertains a file.  It's metadata
attached to a file, not metadata attached to the whole archive, even when
part of a global header.  Thus each file should have attached what path
operand it came from.  A file doesn't have the attribute what other path
operands git received, only the path operand that caused the inclusion of
that one file is an attribute of the file.

Not an issue; we can make our own rules for our own keywords.

Well, yes, but they should still stick to the semantic concept POSIX
imposes for extended headers: headers pertain to files and the only
difference between a g header and an x header is that the former applies
until it is revoked by a new g header or overridden by an x header.
Not sticking to this concept can lead to weird problems with programs
that modify tar archives (like GNU-tar) and is not future proof. Better
stick to the standard.

It's easy enough, I think: For each archive entry check if it is 
explicity mentioned in the list of paths to archive and write an 
extended header with GIT.path before proceeding as usual, no?  And for 
the common case without path specification (meaning all files are 
included) no such header would be needed.

René
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html