Re: Fw: Curiosity

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2021-12-17 17:15, João Victor Bonfim wrote:
I suspect that for most algorithms and their implementations, this would

not result in repeatable "recompressed" results. Thus the checked-out

files might be different every time you checked them out. :(

How or why?


Here are some reasons I can think of (I am no expert):

1) Most compression formats are file formats, not exact algorithms, thus different program implementations of similar algorithms can create vastly different outputs.

2) The same program will evolve over time, get improvements, bug fixes, etc. so each version of the same program could vary over time even with the same settings. The same program version on different platforms could have different output.

3) Settings, compression programs have compression levels, perhaps memory utilization parameters... The way the program measures these may not be deterministic and non-repeatable.

4) Threading. Some compressions algorithms, such as git repack itself, can use several threads to analyze the input data. And since the timing between different threads is not deterministic, when cooperating, they can have different results.

Much of this has to do with the idea that there is usually no such thing as "done" when it comes to compression. You can probably search infinitely to try and find more data patterns to compress the data more. Thus compression programs have to have limits based on heuristics (how far to look ahead/behind, how many patterns to remember...) programmed into them to come to an end somehow. How these limits are determined can sometimes be non deterministic, it may even involve system resources (how much RAM the machine has, how long it has run...) or system config.

I hope that helps,

-Martin


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick
<mfick@xxxxxxxxxxxxxx> escreveu:

On 2021-12-16 14:20, João Victor Bonfim wrote:

> > To expand on this, if what you're storing is already compressed, like
> >
> > Ogg Vorbis files or PNGs, like are found in that repository, then
> >
> > generally they will not delta well. This is also true of things like
> >
> > Microsoft Office or OpenOffice documents, because they're essentially
> >
> > Zip files.
> >
> > The delta algorithm looks for similarities between files to compress
> >
> > them. If a file is already compressed using something like Deflate,
> >
> > used in PNGs and Zip files, then even very similar files will
> >
> > generally
> >
> > look very different, so deltification will generally be ineffective.

...

> Maybe I am thinking too outside the box, but wouldn't it be quite more
>
> effective for git to identify compressed files, specially on edge cases
>
> where the compression doesn't have a good chemistry with delta
>
> compression,
>
> decompress them for repo storage while also storing the compression
>
> algorithm as some metadata tag (like a text string or an ID code
>
> decided
>
> beforehand), and, when creating the work mirrors, return the
>
> compression
>
> to its default state before checkout?

I suspect that for most algorithms and their implementations, this would

not result in repeatable "recompressed" results. Thus the checked-out

files might be different every time you checked them out. :(

-Martin

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The Qualcomm Innovation Center, Inc. is a member of Code

Aurora Forum, hosted by The Linux Foundation

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux