Re: Delta compression not so effective

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen <mstormo@xxxxxxxxx> wrote:
>
> I reran the repack with the options above (dropping the zlib=9, as you
> suggested)
>
>     $ time git -c pack.threads=4 repack -a -d -F \
>                --window=350 --depth=250 --window-memory=30g
>
> and ended up with
>     $ du -sh .
>     205G        .
>
> In other words, going from 6G to 30G window didn't help a lick on finding
> deltas for those binaries.

Ok.

> I did
>     fprintf(stderr, "%s %u %lu\n",
>             sha1_to_hex(delta_list[i]->idx.sha1),
>             delta_list[i]->hash,
>             delta_list[i]->size);
>
> I assume that's correct?

Looks good.

> I've removed all commit messages, and "sanitized" some filepaths etc, so
> name hashes won't match what's reported, but that should be fine. (the
> object_entry->hash seems to be just a trivial uint32 hash for sorting
> anyways)

Yes. I see your name list and your pack-file index.

> BUT, if I look at the last 3 entries of the sorted git verify-pack output,
> and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I
> get:
...
> while I cannot find ANY of them in the delta_list output?? \

Yes. You have a lot of of object names in that log file you sent in
private that aren't in the delta list.

Now, objects smaller than 50 bytes we don't ever try to even delta. I
can't see the object sizes when they don't show up in the delta list,
but looking at some of those filenames I'd expect them to not fall in
that category.

I guess you could do the printout a bit earlier (on the
"to_pack.objects[]" array - to_pack.nr_objects is the count there).
That should show all of them. But the small objects shouldn't matter.

But if you have a file like

   extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib

I would have assumed that it has a size that is > 50. Unless those
"extern" things are placeholders?

> You might get an idea for how to easily create a repo which reproduces the
> issue, and which would highlight it more easily for the ML.

Looking at your sorted object list ready for packing, it doesn't look
horrible. When sorting for size, it still shows a lot of those large
files with the same name hash, so they sorted together in that form
too.

I do wonder if your dll data just simply is absolutely horrible for
xdelta. We've also limited the delta finding a bit, simply because it
had some O(m*n) behavior that gets very expensive on some patterns.
Maybe your blobs trigger some of those case.

The diff-delta work all goes back to 2005 and 2006, so it's a long time ago.

What I'd ask you to do is try to find if you could make a reposity of
just one of the bigger DLL's with its history, particularly if you can
find some that you don't think is _that_ sensitive.

Looking at it, for example, I see that you have that file

   extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so

that seems to have changed several times, and is a largish blob. Could
you try creating a repository with git fast-import that *only*
contains that file (or pick another one), and see if that delta's
well?

And if you find some case that doesn't xdelta well, and that you feel
you could make available outside, we could have a test-case...

                 Linus



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]