Re: Fw: Curiosity

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Sat, 18 Dec 2021 01:34:11 +0000

On 2021-12-18 at 00:15:59, João Victor Bonfim wrote:
> > I suspect that for most algorithms and their implementations, this would
> >
> > not result in repeatable "recompressed" results. Thus the checked-out
> >
> > files might be different every time you checked them out. :(
> 
> How or why?
> 
> Sincere question.

A lossless compression algorithm has to produce an encoded value that,
when decoded, must produce the original input.  Ideally, it will also
reduce the file size of the original input.  Beyond that, there's a
great deal of freedom to implement that.

Just taking Deflate, which is used in zlib and gzip, as an example,
there are different compression settings that control the size of the
window to use that affect compression speed, quality of compression
(resulting size), and memory usage.  One might prefer using gzip -1 to
get better performance or use less memory, or gzip -9 to reduce the file
size as much as possible.

Even when the same settings are used, the technique used can vary
between versions of the software.  For example, GitHub effectively uses
git archive to generate archives, and one time when they upgraded their
servers, the compression changed in the tarballs and zip files, and
everybody who was relying on the archives being bit-for-bit identical[0]
had a problem.

So it would be nearly impossible to produce bit-for-bit repeatable
results without specifying a specific, hard-coded implementation, and
even in that case, the behavior might need to change for security
reasons, so it would end up being difficult to achieve.

[0] Neither Git nor GitHub provides this guarantee, so please do not
make this mistake.  If you need a fixed bit-for-bit tarball, save it as
a release artifact.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA
Attachment:
signature.asc

Description: PGP signature