"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes: > +Overview > +-------- > + > +Many people find it convenient to have tar archives that are bit-for-bit > +identical between versions. This can be valuable to validate that an archive > +has not changed using a cryptographic hash without needing to store the archive > +itself. > + > +However, up to now, Git has not guaranteed a consistent format, although people > +often make the assumption that Git's archives will always be bit-for-bit > +identical. This has led to several notable problems with various forges. > + > +This document proposes a canonical tar format based on the POSIX pax format that > +is bit-for-bit identical. It is referred to as ctar-v1 (canonical tar version 1). "is identical to what"? Ditto for the one in the previous paragraph. The first paragraph is better in that there is "between versions", even though it would be easier to grok if we made it more clear that we are talking about versions of the software that is used to create the archive, not the version of contents being archived. Our goal is that serializing the same tree object or the same commit object result in bit-for-bit identical result, no matter which version of Git is used, and no matter what platform the Git used to create the archive was built on. Mentioning both what we take an archive out of (i.e. tree or commit) and we can use different versions of Git to create archives, in the description would make it easier to grok. > +Goals and Rationale > +------------------- > + > +The goals for this format are that it is first and foremost reproducible, that > +identical trees produce identical results, that it is simple and easy to > +implement correctly, and that it is useful in general. While we don't consider > +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or > +sparse files), there is intense interest in reproducible builds, and so it makes > +sense to design something that can see general use for software interchange. Perfect. > +Because the goal is strict reproducibility, this format doesn't honor > +`tar.umask` or other options that can produce different output. It serializes > +all timestamps as the Epoch, which produces identical results whether the tree > +is serialized as a tree, commit, or tag. This is consistent with the behaviour > +of some other tar serializers, including the default for modern Rust crates, and > +is not believed to pose any interoperability problems. > +Object IDs are not included in this version of the format because this produces > +non-identical data when identical data is serialized with different hash > +algorithms. Declaring that we'll always peel a tag or a commit down to a tree is one sure way to avoid having to worry about object name hashes, but aren't we discarding too much utility by doing so? This is probably debatable. The commit object name embedded in the extended header of an archive makes it trivial to identify what version the archive _claims_ to have been taken from (you could also embed it in the filename that stores archive, but the use of the embedded metainfo makes it more robust against file names). And running "git archive" twice, with different versions of Git on different architectures, should be reproducible as long as both invokers expressed their desire to see the commit object name in the archive by passing the commit, not its tree, to the command, and they are using the same hash algorithm. In the world where multiple hash functions are in use, a commit that is being archived may have one or two "object names", but it should not be hard to use one extended header item per each to store one or both, I would imagine. Having said all that, I think stripping the commit object name (or tags) is a better design. Imagine that I see I created a tarball earlier and published its hash, but later lost the tarball. By not allowing any commit object name in the archive, it would force me to somehow name the tarball in such a way that I can tell which commit I used to create it, e.g. "git-e83c516331.tar". Other people can notice the filename and without having seen the bytes in it, they can try running "git archive e83c516331" in their repository and see the output matches the hash I published earlier. Having commit or tag embedded in the archive would make it harder to do this kind of things. By the way, other potentially interesting points are: - Do we want to ignore "export-subst" for stability? - "git archive" can be invoked with pathspec to archive only a subset of paths. - "git archive" could be extended to include submoudule trees recursively in the same output. The latter two are trivial to support, but we need to make sure that we do not screw up the ordering of paths in the output, especially for the last one, when we add it. > +Introduction to the Underlying Format > +------------------------------------- > ... > +A global extended header sets metadata for the entire file, and a per-file > +extended header applies to only the to which it corresponds. A per-file "only the to which" -> "only the file to which" > +extended header overrides any data specified in the global extended header, and > +all extended headers override any data stored in a normal ustar per-file header > +block. > +While pax extensions are widely supported by most modern versions of tar > +(including versions on Windows and all major open-source OSes), some older > +archivers and non-tar implementations which do not understand them typically > +extract the extended headers as regular files. Thus, it's helpful to have these > +entries have reasonable permissions and unique names. Surely, and to make things reproducible, they shouldn't just be reasonable and unique. They should be exactly as we define in the specification. > +General Architecture > +-------------------- > + > +All canonical tar archives are valid POSIX pax archives as that format is > +defined in POSIX.1-2017. Every archive will have a global header indicating the > +version and format and what types of data are valid in the archive. > + > +Every file serialized in the archive is serialized in lexicographical order by > +its bytes. A directory is always serialized before its contents, and a "by its bytes" -> "by the bytes in its filename" or something? Surely we do not sort by contents ;-) > +directory is never serialized with a trailing slash. If a system uses a Unicode > +encoding other than UTF-8, it encodes filenames as UTF-8. This is a bit hard to grok. Do you mean there may be UTF-16 system where the data in our tree objects, whose paths are recorded in UTF-8, but "git checkout" of the tree may result in files in the native filename on that system, i.e. UTF-16 not UTF-8? And even on such a system, running "git archive" would record paths in the archive in UTF-8 (i.e. the same as what was in the tree object)? Or do you mean something stronger, like on a Latin-1 system with Latin-1 project that used Latin-1 as pathnames even in the tree objects, when "git archive" produces an archive, the paths in it shall be transcoded from the original Latin-1 pathnames to UTF-8? > +Each file shall contain a pax extended header record. > + > +It is possible to encode some extended headers in multiple ways because the > +length in the header encodes its own length. For example, in cases where the > +length value can be encoded as either 99 or 100, both can lead to identical > +header data. The shortest possible encoding must always be used. ;-) > +In any event where multiple encodings are possible, the shortest and, if there > +is still confusion, lexicographically first (by byte value) must always be used. ;-) > +All unspecified padding is filled with NUL bytes. Perhaps we should change the casual mention "zero"s we saw earlier about with "NUL bytes", too. > +Version Number > +-------------- > + > +The version number for this version is `ctar-v1`. > + > +Extended Headers > +---------------- > + > +Global Extended Header > +~~~~~~~~~~~~~~~~~~~~~~ > + > +The global extended header (record `g`) shall contain one header: > +`CTAR.version`, which contains the version number specified above. > + > +The contents of the ustar header for the global extended header are as below, > +except that the `name` field contains `pax_global_header`. "as below" meaning...? The same as what is listed in "Per-File Extended Header"? There is no `name` field listed there, though. > +Per-File Extended Header > +~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Each file has a per-file extended header. > + > +The following per-file extended header fields are included: > + > +|=== > +| Field Name | When Present | Value > + > +| `atime` | always | `0` > +| `mtime` | always | `0` > +| `size` | always | size of the data in bytes > +| `path` | always | full path name of the file These are length-prefixed data, so we do not have to worry about overly long pathnames or symlinks? > +| `uid` | always | `0` > +| `gid` | always | `0` > +| `uname` | always | `root` > +| `gname` | always | `root` > +| `linkpath` | symbolic link | full path name of the link destination > +| `hdrcharset` | binary path | `BINARY` > + > +Note that the `hdrcharset` entry appears if and only if the `path` or, if > +present, the `linkpath`, header contains a non-UTF-8 encoded string. Because > +Git does not store the encoding of file names, it has no way of knowing whether > +a file name which could be valid UTF-8 actually is, but for the purposes of > +compatibility, such file names are assumed to be UTF-8 and are not declared as > +binary. This improves portability to systems which always use Unicode. > +However, we because we do not know for certain whether these values are UTF-8, "we because" -> "because" > +we avoid explicitly declaring them as such and rely on the default archiver > +behavior, which may be more sensible. So, do we or do we not store hdrcharset? Producing Git does not know if the pathnames stored in the tree it is asked to produce archive for are not in UTF-8, so it assumes everything is in UTF-8 hence does not see the need to add hdrcharset? > +The `path` field contains the full path name without a leading slash or leading > +`.` or `..` component. The path never contains a directory component which is > +`.` or `..`. > + > +The `linkpath` field contains the full symbolic link destination. `.` and `..` > +components are permitted if the destination contains those values. In other words, we just store the contents of the blob that represents the symbolic link there? I wonder if we do anything special if a blob, that is pointed at in an entry in a tree whose mode bits are 120000, has NUL in it (should we teach fsck to flag it, for example)? > +In all cases, path names use `/` as the directory separator. > + > +The reason for always including most of the entries in the archive is to aid in > +implementing and testing correct serialization. If these entries are always > +present, then this process becomes much simpler, whereas if they are only > +included as needed, then errors are more likely. The order of entries need to be specified when we aim for bit-for-bit reproduceability, no? > +The `name` field of the ustar header of this extended header is `paxheader.%d`, > +where `%d` represents the shortest-form decimal integer encoding the index of > +this file in the archive, starting with 0. All files, directories, and links of > +whatever kind are counted, but extended headers are not. > +Serialization of Extended Headers > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +When serializing the header block for an extended header, the following values "the header block" -> "the ustar header block" to match the next section, probably. > +should be used. Note that all text fields are be NUL-padded on the right when > +they do not fill the field, and all octal fields are left-padded with zeros such > +that they fill the field with a single trailing NUL. An empty field contains > +only NULs. > + > +|=== > +| Field Name | Value > + > +| `name` | `pax_global_header` (global) or `paxheader.%d` (per-file) (see above) > +| `mode` | `0640` > +| `uid` | `0` > +| `gid` | `0` > +| `size` | the size of the extended header in bytes > +| `mtime` | `0` (the Epoch) > +| `chksum` | as specified in the standard > +| `typeflag` | `g` (global) or `x` (per-file) > +| `linkname` | empty > +| `magic` | as specified in the standard > +| `version` | as specified in the standard > +| `uname` | `root` > +| `gname` | `root` > +| `devmajor` | `0` > +| `devminor` | `0` > +| `prefix` | empty > +|=== These are barebone header fields, not extended headers. Do we want to refer to some canonical sources so that readers understand that unlike the extended headres we are talking about fixed-length fields? The description above talks about "padding", but that of course applies to fixed width columns. > +When encoding the data for an extended header, all entries are sorted in order > +by the byte values of their keys as encoded in UTF-8. Duplicate keys are not > +permitted. > + > +Because the format allows multiple length encodings of some values, the shortest > +possible encoding must always be used.