Re: [RFC PATCH 1/1] Document a fixed tar format for interoperability

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Tue, 7 Feb 2023 22:34:14 +0000

On 2023-02-06 at 21:08:59, Junio C Hamano wrote:
> "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes:
> "is identical to what"?  Ditto for the one in the previous
> paragraph.  The first paragraph is better in that there is "between
> versions", even though it would be easier to grok if we made it more
> clear that we are talking about versions of the software that is
> used to create the archive, not the version of contents being
> archived.
> 
> Our goal is that serializing the same tree object or the same commit
> object result in bit-for-bit identical result, no matter which
> version of Git is used, and no matter what platform the Git used to
> create the archive was built on.  Mentioning both what we take an
> archive out of (i.e. tree or commit) and we can use different
> versions of Git to create archives, in the description would make it
> easier to grok.

I can update that to reflect things more accurately.

> > +Goals and Rationale
> > +-------------------
> > +
> > +The goals for this format are that it is first and foremost reproducible, that
> > +identical trees produce identical results, that it is simple and easy to
> > +implement correctly, and that it is useful in general.  While we don't consider
> > +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or
> > +sparse files), there is intense interest in reproducible builds, and so it makes
> > +sense to design something that can see general use for software interchange.
> 
> Perfect.
> 
> > +Because the goal is strict reproducibility, this format doesn't honor
> > +`tar.umask` or other options that can produce different output.  It serializes
> > +all timestamps as the Epoch, which produces identical results whether the tree
> > +is serialized as a tree, commit, or tag.  This is consistent with the behaviour
> > +of some other tar serializers, including the default for modern Rust crates, and
> > +is not believed to pose any interoperability problems.
> 
> > +Object IDs are not included in this version of the format because this produces
> > +non-identical data when identical data is serialized with different hash
> > +algorithms.
> 
> Declaring that we'll always peel a tag or a commit down to a tree is
> one sure way to avoid having to worry about object name hashes, but
> aren't we discarding too much utility by doing so?
> 
> This is probably debatable.  The commit object name embedded in the
> extended header of an archive makes it trivial to identify what
> version the archive _claims_ to have been taken from (you could also
> embed it in the filename that stores archive, but the use of the
> embedded metainfo makes it more robust against file names).  And
> running "git archive" twice, with different versions of Git on
> different architectures, should be reproducible as long as both
> invokers expressed their desire to see the commit object name in the
> archive by passing the commit, not its tree, to the command, and
> they are using the same hash algorithm.

It's true that it makes it easy to look up, but I can say I've never
used that functionality.  I think very few people actually know it
exists.

> Having said all that, I think stripping the commit object name (or
> tags) is a better design.  Imagine that I see I created a tarball
> earlier and published its hash, but later lost the tarball.  By not
> allowing any commit object name in the archive, it would force me to
> somehow name the tarball in such a way that I can tell which commit
> I used to create it, e.g. "git-e83c516331.tar".  Other people can
> notice the filename and without having seen the bytes in it, they
> can try running "git archive e83c516331" in their repository and see
> the output matches the hash I published earlier.  Having commit or
> tag embedded in the archive would make it harder to do this kind of
> things.

Most people do this anyway (except with a tag name), so I don't think
it's a big deal to have this as the primary mechanism.

> By the way, other potentially interesting points are:
> 
>  - Do we want to ignore "export-subst" for stability?

I think that would be a good idea.  I'll add it in v2.

>  - "git archive" can be invoked with pathspec to archive only a
>    subset of paths.

True.  I don't think that's a problem as long as we generate paths
correctly.  I'll be sure to add tests for it, though.

> > +Introduction to the Underlying Format
> > +-------------------------------------
> > ...
> > +A global extended header sets metadata for the entire file, and a per-file
> > +extended header applies to only the to which it corresponds.  A per-file
> 
> "only the to which" -> "only the file to which"

Will fix.

> > +While pax extensions are widely supported by most modern versions of tar
> > +(including versions on Windows and all major open-source OSes), some older
> > +archivers and non-tar implementations which do not understand them typically
> > +extract the extended headers as regular files.  Thus, it's helpful to have these
> > +entries have reasonable permissions and unique names.
> 
> Surely, and to make things reproducible, they shouldn't just be
> reasonable and unique.  They should be exactly as we define in the
> specification.

Yes, of course.  This is more to indicate why we've made the decisions
to name them as they are and give them the permissions we did.

> > +Every file serialized in the archive is serialized in lexicographical order by
> > +its bytes.  A directory is always serialized before its contents, and a
> 
> "by its bytes" -> "by the bytes in its filename" or something?
> Surely we do not sort by contents ;-)

Good point.  We should avoid ambiguity.

> > +directory is never serialized with a trailing slash.  If a system uses a Unicode
> > +encoding other than UTF-8, it encodes filenames as UTF-8.
> 
> This is a bit hard to grok.  Do you mean there may be UTF-16 system
> where the data in our tree objects, whose paths are recorded in UTF-8,
> but "git checkout" of the tree may result in files in the native
> filename on that system, i.e. UTF-16 not UTF-8?  And even on such a
> system, running "git archive" would record paths in the archive in
> UTF-8 (i.e. the same as what was in the tree object)?  Or do you
> mean something stronger, like on a Latin-1 system with Latin-1
> project that used Latin-1 as pathnames even in the tree objects,
> when "git archive" produces an archive, the paths in it shall be
> transcoded from the original Latin-1 pathnames to UTF-8?

This means if, on Windows, someone uses --add-file or
--add-virtual-file, those paths will be encoded in UTF-8, not UTF-16.

> > +Version Number
> > +--------------
> > +
> > +The version number for this version is `ctar-v1`.
> > +
> > +Extended Headers
> > +----------------
> > +
> > +Global Extended Header
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The global extended header (record `g`) shall contain one header:
> > +`CTAR.version`, which contains the version number specified above.
> > +
> > +The contents of the ustar header for the global extended header are as below,
> > +except that the `name` field contains `pax_global_header`.
> 
> "as below" meaning...?  The same as what is listed in "Per-File
> Extended Header"?  There is no `name` field listed there, though.

I'll make a clearer reference.

> > +Per-File Extended Header
> > +~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Each file has a per-file extended header.
> > +
> > +The following per-file extended header fields are included:
> > +
> > +|===
> > +| Field Name   | When Present  | Value
> > +
> > +| `atime`      | always        | `0`
> > +| `mtime`      | always        | `0`
> > +| `size`       | always        | size of the data in bytes
> > +| `path`       | always        | full path name of the file
> 
> These are length-prefixed data, so we do not have to worry about
> overly long pathnames or symlinks?

Correct.  This data can be arbitrarily long as long as all the metadata
can be encoded in a ustar header, so we're limited to at least several
gigabytes or so.  I don't think anybody thinks of that as a practical
limitation on filenames or other metadata.

> "we because" -> "because"

Will fix.

> > +we avoid explicitly declaring them as such and rely on the default archiver
> > +behavior, which may be more sensible.
> 
> So, do we or do we not store hdrcharset?  Producing Git does not know
> if the pathnames stored in the tree it is asked to produce archive
> for are not in UTF-8, so it assumes everything is in UTF-8 hence
> does not see the need to add hdrcharset?

pax says that these values are UTF-8 if not specified.  If they're
clearly not UTF-8, we use `hdcharset` and say they're binary.  If they
look like valid UTF-8, we don't use `hdrcharset` and pretend they are in
fact UTF-8, in case somebody just likes causing discord by using
Windows-1252 that looks like UTF-8.

> In other words, we just store the contents of the blob that
> represents the symbolic link there?  I wonder if we do anything
> special if a blob, that is pointed at in an entry in a tree whose
> mode bits are 120000, has NUL in it (should we teach fsck to flag
> it, for example)?

This is the destination of the symlink, yes.  We can simply check for
NUL and abort; I don't think that's an unreasonable behaviour in any
case.

> The order of entries need to be specified when we aim for
> bit-for-bit reproduceability, no?

Yes.  That's specified in the next section, where we say this:

  When encoding the data for an extended header, all entries are sorted in order
  by the byte values of their keys as encoded in UTF-8.  Duplicate keys are not
  permitted.

I'll make a reference to that section and describe it more clearly.

> "the header block" -> "the ustar header block" to match the next
> section, probably.

I'll update that.

> These are barebone header fields, not extended headers.  Do we want
> to refer to some canonical sources so that readers understand that
> unlike the extended headres we are talking about fixed-length fields? 
> The description above talks about "padding", but that of course
> applies to fixed width columns.

Correct.  I'll mention that these are the values in the ustar header for
the extended header.  I'll also put some references in to the
documentation.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA
Attachment:
signature.asc

Description: PGP signature