On 2021-12-17 17:15, João Victor Bonfim wrote:
I suspect that for most algorithms and their implementations, this
would
not result in repeatable "recompressed" results. Thus the checked-out
files might be different every time you checked them out. :(
How or why?
Here are some reasons I can think of (I am no expert):
1) Most compression formats are file formats, not exact algorithms, thus
different program implementations of similar algorithms can create
vastly different outputs.
2) The same program will evolve over time, get improvements, bug fixes,
etc. so each version of the same program could vary over time even with
the same settings. The same program version on different platforms could
have different output.
3) Settings, compression programs have compression levels, perhaps
memory utilization parameters... The way the program measures these may
not be deterministic and non-repeatable.
4) Threading. Some compressions algorithms, such as git repack itself,
can use several threads to analyze the input data. And since the timing
between different threads is not deterministic, when cooperating, they
can have different results.
Much of this has to do with the idea that there is usually no such thing
as "done" when it comes to compression. You can probably search
infinitely to try and find more data patterns to compress the data more.
Thus compression programs have to have limits based on heuristics (how
far to look ahead/behind, how many patterns to remember...) programmed
into them to come to an end somehow. How these limits are determined can
sometimes be non deterministic, it may even involve system resources
(how much RAM the machine has, how long it has run...) or system config.
I hope that helps,
-Martin
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick
<mfick@xxxxxxxxxxxxxx> escreveu:
On 2021-12-16 14:20, João Victor Bonfim wrote:
> > To expand on this, if what you're storing is already compressed, like
> >
> > Ogg Vorbis files or PNGs, like are found in that repository, then
> >
> > generally they will not delta well. This is also true of things like
> >
> > Microsoft Office or OpenOffice documents, because they're essentially
> >
> > Zip files.
> >
> > The delta algorithm looks for similarities between files to compress
> >
> > them. If a file is already compressed using something like Deflate,
> >
> > used in PNGs and Zip files, then even very similar files will
> >
> > generally
> >
> > look very different, so deltification will generally be ineffective.
...
> Maybe I am thinking too outside the box, but wouldn't it be quite more
>
> effective for git to identify compressed files, specially on edge cases
>
> where the compression doesn't have a good chemistry with delta
>
> compression,
>
> decompress them for repo storage while also storing the compression
>
> algorithm as some metadata tag (like a text string or an ID code
>
> decided
>
> beforehand), and, when creating the work mirrors, return the
>
> compression
>
> to its default state before checkout?
I suspect that for most algorithms and their implementations, this
would
not result in repeatable "recompressed" results. Thus the checked-out
files might be different every time you checked them out. :(
-Martin
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation
--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation