On Sat, May 25 2019, René Scharfe wrote: > Am 24.05.19 um 10:13 schrieb Jeff King: >> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote: >> >>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 >>>> (from Debian's bsdtar package). What does your "tar --version" say? >>> >>> bsdtar 2.8.3 - libarchive 2.8.3 >> >> Interesting. I wonder if there was a libarchive bug that was fixed >> between 2.8.3 and 3.3.3. >> >>>> Git does write a pax header with the commit id in it as a comment. >>>> Presumably that's what it's complaining about (but it is not malformed >>>> according to any tar I've tried). If you feed git-archive a tree rather >>>> than a commit, that is omitted. What does: >>>> >>>> git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null >>>> >>>> say? If it doesn't complain, then we know it's indeed the pax comment >>>> field. >>> >>> It also complains >>> >>> $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null >>> tar: Ignoring malformed pax extended attribute >>> tar: Error exit delayed from previous errors. >> >> Ah, OK. So it's not the comment field at all, but some other entry. >> >>> Some more context: I work at Sourcegraph.com We mirror a lot of repos >>> from github.com. We usually interact with a working copy by running >>> git archive on it in our infrastructure. This is the first repository >>> that I have noticed which produces this error. An interesting thing to >>> note is the commit metadata contains a lot of non-ascii text which was >>> my guess at what my be tripping up the tar creation. >> >> Yeah, though the only thing that makes it into the tarfile is the actual >> tree entries. I'd imagine the file content is not likely to be a source >> of problems, as it's common to see binary gunk there. Most of the >> filenames are pretty mundane, but this symlink destination is a little >> funny: >> >> $ git archive ... | tar tvf - | grep nicovideo4as.swc >> lrwxrwxrwx root/root 0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024 >> >> That's not the full story, though. It is indeed a symlink in the >> tree: >> >> $ git ls-tree -r HEAD libs/nicovideo4as.swc >> 120000 blob ec3137b5fcaeae25cf67927068af116517683806 libs/nicovideo4as.swc >> >> But the contents of that blob, which should be the destination filename, >> are definitely not: >> >> $ git cat-file blob ec3137b5f | wc -c >> 57804 >> $ git cat-file blob ec3137b5f | xxd | head -1 >> 00000000: 504b 0304 1400 0800 0800 5069 694e 0000 PK........PiiN.. >> >> There's quite a bit more data there. And what tar showed us goes up to >> the first NUL, which does not seem surprising. > > That (the symlink target) is a ZIP file with the following contents: > > Length Method Size Cmpr Date Time CRC-32 Name > -------- ------ ------- ---- ---------- ----- -------- ---- > 39733 Defl:N 3403 91% 2019-03-09 13:10 489e1be1 catalog.xml > 54131 Defl:N 54151 0% 2019-03-09 13:10 32f57322 library.swf > -------- ------- --- ------- > 93864 57554 39% 2 files > > And link targets longer than 100 characters are encoded in an extended > Pax header. > > (Usually symlink targets are paths, not file contents.) > >> It's possible Git is doing the wrong thing on the writing side, but >> given that newer versions of bsdtar handle it fine, I'd guess that the >> old one simply had problems consuming poorly formed symlink filenames. > > Git preserves symlink targets with embedded NULs in the repository and > in generated tar files. Not sure if GNU tar and bsdtar truncating them > at the first NUL is a bug. I'm also not sure if there is a platform > that would allow creating such a symlink in the file system, or how one > is supposed to use it. > > We could truncate symlink targets at the first NUL as well in git > archive -- but that would be a bit sad, as the archive formats allow > storing the "real" target from the repo, with NUL and all. We could > make git fsck report such symlinks. > > Can Unicode symlink targets contain NULs? We wouldn't want to damage > them even if we decide to truncate. I don't see a practical use for this case, and maybe we should even fsck check for the blob representing the symlink target having a \0 in it as suggested upthread. But that being said, this assumption that data in a tar archive will get written to a FS of some sort isn't true. There's plenty of consumers of the format that read it in-memory and stream its contents out to something else entirely, e.g. taking "git archive --remote" output, parsing it with e.g. [1] and throwing some/all of the content into a database. 1. https://metacpan.org/pod/Archive::Tar