On Thu, Feb 02 2023, rsbecker@xxxxxxxxxxxxx wrote: > On February 2, 2023 6:02 PM, brian m. carlson wrote: >>On 2023-02-01 at 23:37:19, Junio C Hamano wrote: >>> "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes: >>> >>> > I don't think a blurb is necessary, but you're basically >>> > underscoring the problem, which is that nobody is willing to promise >>> > that compression is consistent, but yet people want to rely on that >>> > fact. I'm willing to write and implement a consistent tar spec and >>> > to guarantee compatibility with that, but the tension here is that >>> > people also want gzip to never change its byte format ever, which >>> > frankly seems unrealistic without explicit guarantees. Maybe the >>> > authors will agree to promise that, but it seems unlikely. >>> >>> Just to step back a bit, where does the distinction between >>> guaranteeing the tar format stability and gzip compressed bitstream >>> stability come from? At both levels, the same thing can be expressed >>> in multiple different ways, I think, but spelling out how exactly the >>> compressor compresses is more involved than spelling out how entries >>> in a tar archive is ordered and each entry is expressed, or something? >> >>Yes, at least with my understanding about how gzip and compression in general >>work. >> >>The tar format (and the pax format which builds on it) can mostly be restricted by >>explaining what data is to be included in the pax and tar headers and how it is to be >>formatted. If we say, we will always write such and such information in the pax >>header and sort the keys, and we write such and such information in the tar header, >>then the format is completely deterministic, and we can make nice guarantees. >> >>My understanding about how Lempel-Ziv-based compression algorithms work is that >>there's a lot more freedom to decide how best to compress things and that there >>isn't always a logical obvious choice, but I will admit my understanding is relatively >>limited. If someone thinks we can effectively succeed in supporting compression >>more than just relying on gzip, I would be delighted to be shown to be wrong. > > The nice part about gzip is that it is generally available on > virtually all platforms (or can be easily obtained). Other compression > forms, like bz2, which sometimes produces more dense compression, are > not necessarily available. Availability is something I would be > worried about... I agree with all of that, gzip is in such wide use for a reason. >... (clone and checkout failures). But how would a hypothetical obscure format for "git archive" contribute to clone or checkout failures? Are you thinking of our use of zlib for e.g. loose objects? That's unrelated to this discussion (and I don't think anyone relies on their compressed checksum). > Tar formats are also to be used carefully. Not all platform > implementations of tar support all variants. "ustar" is fairly common > but there are others that are not. Interoperability needs to be the > biggest factor in this decision, IMHO, rather than compression rates. For "git archive" whether you care about interoperability depends on the target audience of your archive, and in any case I don't see why we need to worry about it, except to perhaps note that some are more portable than others if we e.g. had a built-in "tar.bz2" helper method. > The alternative is having git supply its own implementation, but that > is a longer term migration problem, resembling the SHA-256 migration. I've noted elsewhere in this thread that I don't see the point of shipping a fallback "gzip" beyond the "git archive gzip" we have already, but even if we did that the scope of that seems pretty simple, and *much* easier than the SHA-256 migration.