On 2021-12-18 at 00:15:59, João Victor Bonfim wrote: > > I suspect that for most algorithms and their implementations, this would > > > > not result in repeatable "recompressed" results. Thus the checked-out > > > > files might be different every time you checked them out. :( > > How or why? > > Sincere question. A lossless compression algorithm has to produce an encoded value that, when decoded, must produce the original input. Ideally, it will also reduce the file size of the original input. Beyond that, there's a great deal of freedom to implement that. Just taking Deflate, which is used in zlib and gzip, as an example, there are different compression settings that control the size of the window to use that affect compression speed, quality of compression (resulting size), and memory usage. One might prefer using gzip -1 to get better performance or use less memory, or gzip -9 to reduce the file size as much as possible. Even when the same settings are used, the technique used can vary between versions of the software. For example, GitHub effectively uses git archive to generate archives, and one time when they upgraded their servers, the compression changed in the tarballs and zip files, and everybody who was relying on the archives being bit-for-bit identical[0] had a problem. So it would be nearly impossible to produce bit-for-bit repeatable results without specifying a specific, hard-coded implementation, and even in that case, the behavior might need to change for security reasons, so it would end up being difficult to achieve. [0] Neither Git nor GitHub provides this guarantee, so please do not make this mistake. If you need a fixed bit-for-bit tarball, save it as a release artifact. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature