On February 2, 2023 6:02 PM, brian m. carlson wrote: >On 2023-02-01 at 23:37:19, Junio C Hamano wrote: >> "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes: >> >> > I don't think a blurb is necessary, but you're basically >> > underscoring the problem, which is that nobody is willing to promise >> > that compression is consistent, but yet people want to rely on that >> > fact. I'm willing to write and implement a consistent tar spec and >> > to guarantee compatibility with that, but the tension here is that >> > people also want gzip to never change its byte format ever, which >> > frankly seems unrealistic without explicit guarantees. Maybe the >> > authors will agree to promise that, but it seems unlikely. >> >> Just to step back a bit, where does the distinction between >> guaranteeing the tar format stability and gzip compressed bitstream >> stability come from? At both levels, the same thing can be expressed >> in multiple different ways, I think, but spelling out how exactly the >> compressor compresses is more involved than spelling out how entries >> in a tar archive is ordered and each entry is expressed, or something? > >Yes, at least with my understanding about how gzip and compression in general >work. > >The tar format (and the pax format which builds on it) can mostly be restricted by >explaining what data is to be included in the pax and tar headers and how it is to be >formatted. If we say, we will always write such and such information in the pax >header and sort the keys, and we write such and such information in the tar header, >then the format is completely deterministic, and we can make nice guarantees. > >My understanding about how Lempel-Ziv-based compression algorithms work is that >there's a lot more freedom to decide how best to compress things and that there >isn't always a logical obvious choice, but I will admit my understanding is relatively >limited. If someone thinks we can effectively succeed in supporting compression >more than just relying on gzip, I would be delighted to be shown to be wrong. The nice part about gzip is that it is generally available on virtually all platforms (or can be easily obtained). Other compression forms, like bz2, which sometimes produces more dense compression, are not necessarily available. Availability is something I would be worried about (clone and checkout failures). Tar formats are also to be used carefully. Not all platform implementations of tar support all variants. "ustar" is fairly common but there are others that are not. Interoperability needs to be the biggest factor in this decision, IMHO, rather than compression rates. The alternative is having git supply its own implementation, but that is a longer term migration problem, resembling the SHA-256 migration. > >> > That would probably break things, because gzip is GPLv3, and we'd >> > need to ship a much older GPLv2 gzip, which would probably differ >> > from the current behaviour, and might also have some security problems. >> >> Yup, security issues may make bit-for-bit-stability unrealistic. >> IIRC, the last time we had discussion on this topic, we settled on >> stability across the same version of Git (i.e. deterministic result)? In the old days, it was export concerns. Fortunately, git never really hit those in a post-2007 timeframe. I would not bank on this issue staying off the table. --Randall