Re: Fw: Curiosity

Martin Fick <mfick@xxxxxxxxxxxxxx> · Fri, 17 Dec 2021 18:06:56 -0700

On 2021-12-17 17:15, João Victor Bonfim wrote:
I suspect that for most algorithms and their implementations, this 
would

not result in repeatable "recompressed" results. Thus the checked-out

files might be different every time you checked them out. :(

How or why?

Here are some reasons I can think of (I am no expert):

1) Most compression formats are file formats, not exact algorithms, thus 
different program implementations of similar algorithms can create 
vastly different outputs.

2) The same program will evolve over time, get improvements, bug fixes, 
etc. so each version of the same program could vary over time even with 
the same settings. The same program version on different platforms could 
have different output.

3) Settings, compression programs have compression levels, perhaps 
memory utilization parameters... The way the program measures these may 
not be deterministic and non-repeatable.

4) Threading. Some compressions algorithms, such as git repack itself, 
can use several threads to analyze the input data. And since the timing 
between different threads is not deterministic, when cooperating, they 
can have different results.

Much of this has to do with the idea that there is usually no such thing 
as "done" when it comes to compression. You can probably search 
infinitely to try and find more data patterns to compress the data more. 
Thus compression programs have to have limits based on heuristics (how 
far to look ahead/behind, how many patterns to remember...) programmed 
into them to come to an end somehow. How these limits are determined can 
sometimes be non deterministic, it may even involve system resources 
(how much RAM the machine has, how long it has run...) or system config.

I hope that helps,

-Martin

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick
<mfick@xxxxxxxxxxxxxx> escreveu:

On 2021-12-16 14:20, João Victor Bonfim wrote:

> > To expand on this, if what you're storing is already compressed, like
> >
> > Ogg Vorbis files or PNGs, like are found in that repository, then
> >
> > generally they will not delta well. This is also true of things like
> >
> > Microsoft Office or OpenOffice documents, because they're essentially
> >
> > Zip files.
> >
> > The delta algorithm looks for similarities between files to compress
> >
> > them. If a file is already compressed using something like Deflate,
> >
> > used in PNGs and Zip files, then even very similar files will
> >
> > generally
> >
> > look very different, so deltification will generally be ineffective.

...

> Maybe I am thinking too outside the box, but wouldn't it be quite more
>
> effective for git to identify compressed files, specially on edge cases
>
> where the compression doesn't have a good chemistry with delta
>
> compression,
>
> decompress them for repo storage while also storing the compression
>
> algorithm as some metadata tag (like a text string or an ID code
>
> decided
>
> beforehand), and, when creating the work mirrors, return the
>
> compression
>
> to its default state before checkout?

I suspect that for most algorithms and their implementations, this 
would

not result in repeatable "recompressed" results. Thus the checked-out

files might be different every time you checked them out. :(

-Martin

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The Qualcomm Innovation Center, Inc. is a member of Code

Aurora Forum, hosted by The Linux Foundation

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation