On 1/31/23 4:54 AM, brian m. carlson wrote: > Part of the reason I think this is valuable is that once SHA-1 and > SHA-256 interoperability is present, git archive will change the > contents of the archive format, since it will embed a SHA-256 hash into > the file instead of a SHA-1 hash, since that's what's in the repository. > Thus, we can't produce an archive that's deterministic in the face of > SHA-1/SHA-256 interoperability concerns, and we need to create a new > format that doesn't contain that data embedded in it. I assume that whatever the reason for originally embedding the OID into the file is still an applicable reason even if a new PAX format is established for the use of git-archive. It may not be a great reason -- I don't know. Perhaps there's an argument to remove it. But can't that be done irrespective of standardizing the PAX format? ... I'm not deeply knowledgeable about the SHA-256 transition work -- or knowledgeable at all about it, frankly. (Also my understanding was it seems to have stalled as discussed in https://lwn.net/Articles/898522/ -- I understand that you're still enthusiastic about the work? But that doesn't really answer "is there a timeframe for that to ever happen".) But I sort of assumed that the transition work would already have to embed a fair bit of information into the repository about the whole process? Would it not be possible to determine whether a given tag started life as SHA-1 or SHA-256? Maybe even just a date when the repository was converted to work with both, and embed the OID based on whether the tag is tagging contents that were created after that conversion? Seems to me like the problem should be solvable if people want to solve it. ... git-archive run on a commit obviously doesn't have this problem -- it can simply embed the OID for the same argument it was called with. But I assume it's far more common to access tag-based github endpoints. :D > Having said that, I don't think this should be based on the timestamp of > the file, since that means that two otherwise identical archives > differing in timestamp aren't ever going to be the same, and we do see > people who import or vendor other projects. The timestamp of the output file? Surely not. But I only suggested the timestamp of the commit/tag metadata that git-archive is asked to produce output for. And we would need that in order to solve the problem that reproducible github API archive endpoints poses. I'm not sure what the "import or vendor other projects" angle here means. Do you mean people who copy a directory of files into their project? Who expects this to be the same to begin with? And doesn't embedding the OID kill this idea, since the entire point of git commit sha's is that you shouldn't (it should be prohibitively unrealistic to) be able to produce the same one twice in different contexts? I have never said to myself "ah yes, I really would like to be able to download a git auto-generated tarball for project A, and compare its hash to the tarball for project B, and have them compare identical even though they are different projects with different commits". IMHO this isn't an interesting problem to solve -- the interesting problem to solve is that a single absolute URL to a downloadable file should be able to offer documented guarantees that it will always be the same file, even though it is generated on the fly. > Nor do I think we should > attempt to provide consistent compression, since I believe the output of > things like zlib has changed in the past, and we can't continually carry > an old, potentially insecure version of zlib just because the output > changed. People should be able to implement compression using gzip, > zlib, pigz, miniz_oxide, or whatever if they want, since people > implement Git in many different languages, and we won't want to force > people using memory-safe languages like Go and Rust to explicitly use > zlib for archives. I do not think it is realistic or reasonable for people to implement compression using intentionally incompatible replacements for gzip and expect interoperability of any sort. I also don't think people *have* to implement compression in rust using zlib, but if they are going to make a git-alike that produces archives, it would be worth it for them to write whatever memory-safe rust is necessary to memory-safely produce the same output stream of bytes. It's no less feasible than making sure that busybox gzip and GNU gzip produce the same output, surely. Alternatively, they could just not bother with gzip at all, and make their git-alike produce zstd-compressed tarballs, which change their byte outputs every time a new zstd release is published. :D Again, why limit yourself to gzip if you want to be innovative anyway. > That may mean that it's important for people to actually decompress the > archive before checking hashes if they want deterministic behaviour, and > I'm okay with that. You already have to do that if you're verifying the > signature on Git tarballs, since only the uncompressed tar archive is > signed, so I don't think this is out of the question. This is a very kernel.org-centric view of things, I think. I have rarely seen PGP signatures applied to the uncompressed tar except in that context. The vast majority of tarballs with signatures have signed a single compressed tarball and don't concern themselves with, say, providing a rotating backdated changeable list of compression formats with a single signature covering all of them. Nevertheless, in order to handle kernel.org-style tarballs, you are entirely correct that one should be able to handle this. >From experience, I can say that this needs to be selected on a per-tarball basis. Since signature files have filenames, we can match their stems and given foo.tar.asc and foo.tar.gz, check the signature of the output of gzip -dc < foo.tar.gz, but given foo.tar.gz.asc and foo.tar.gz, simply check the signature of the original foo.tar.gz. This doesn't really work for checksums, because you need to settle on one or the other everywhere or else embed decompression information into your checksum metadata field. And for tarballs that are generated once and uploaded to ftp storage, not repeatedly generated on the fly, we know the checksum will never legitimately change, so we *want* to hash the compressed file. Decompressing kernel.org tarballs in order to run PGP on them is *slow*. Although at least one can verify the checksums first without decompression, which is virtually guaranteed to catch invalid source code releases, so if you ever progress to the PGP verification stage it's unlikely to be wasted effort -- that tarball is definitely getting used to build something. -- Eli Schwartz