Re: Delta compression not so effective

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 1 Mar 2017 16:43:24 -0800

On Wed, Mar 1, 2017 at 4:12 PM, Marius Storm-Olsen <mstormo@xxxxxxxxx> wrote:
>
> No, the list of git verify-objects in the previous post was from the bottom
> of the sorted list, so those are the largest blobs, ~249MB..

.. so with a 6GB window, you should easily sill have 20+ objects. Not
a huge window, but it should find some deltas.

But a smaller window - _together_ with a suboptimal sorting choice -
could then result in lack of successful delta matches.

> So, this repo must be knocking several parts of Git's insides. I was curious
> about why it was so slow on the writing objects part, since the whole repo
> is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
> has ~400MB/s continuous throughput available.
>
> iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
> thread (since the "write objects" stage is single threaded, obviously).

So the writing phase isn't multi-threaded because it's not expected to
matter. But if you can't even generate deltas, you aren't just
*writing* much more data, you're compressing all that data with zlib
too.

So even with a fast disk subsystem, you won't even be able to saturate
the disk, simply because the compression will be slower (and
single-threaded).

> Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
> blobs are the same DLLs (multiple of them)

I think the first thing you should test is to repack with fewer
threads, and a bigger pack window. Do somethinig like

  -c pack.threads=4 --window-memory=30g

instead. Just to see if that starts finding deltas.

> Right, now on this machine, I really didn't notice much difference between
> standard zlib level and doing -9. The 203GB version was actually with
> zlib=9.

Don't. zlib has *horrible* scaling with higher compressions. It
doesn't actually improve the end result very much, and it makes things
*much* slower.

zlib was a reasonable choice when git started - well-known, stable, easy to use.

But realistically it's a relatively horrible choice today, just
because there are better alternatives now.

>> Hos sensitive is your material? Could you make a smaller repo with
>> some of the blobs that still show the symptoms? I don't think I want
>> to download 206GB of data even if my internet access is good.
>
> Pretty sensitive, and not sure how I can reproduce this reasonable well.
> However, I can easily recompile git with any recommended
> instrumentation/printfs, if you have any suggestions of good places to
> start? If anyone have good file/line numbers, I'll give that a go, and
> report back?

So the first thing you might want to do is to just print out the
objects after sorting them, and before it starts trying to finsd
deltas.

See prepare_pack() in builtin/pack-objects.c, where it does something like this:

        if (nr_deltas && n > 1) {
                unsigned nr_done = 0;
                if (progress)
                        progress_state = start_progress(_("Compressing
objects"),
                                                        nr_deltas);
                QSORT(delta_list, n, type_size_sort);
                ll_find_deltas(delta_list, n, window+1, depth, &nr_done);
                stop_progress(&progress_state);

and notice that QSORT() line: that's what sorts the objects. You can
do something like

                for (i = 0; i < n; i++)
                        show_object_entry_details(delta_list[i]);

right after that QSORT(), and make that print out the object hash,
filename hash, and size (we don't have the filename that the object
was associated with any more at that stage - they take too much
space).

Save off that array for off-line processing: when you have the object
hash, you can see what the contents are, and match it up wuith the
file in the git history using something like

   git log --oneline --raw -R --abbrev=40

which shows you the log, but also the "diff" in the form of "this
filename changed from SHA1 to SHA1", so you can match up the object
hashes with where they are in the tree (and where they are in
history).

So then you could try to figure out if that type_size_sort() heuristic
is just particularly horrible for you.

In fact, if your data is not *so* sensitive, and you're ok with making
the one-line commit logs and the filenames public, you could make just
those things available, and maybe I'll have time to look at it.

I'm in the middle of the kernel merge window, but I'm in the last
stretch, and because of the SHA1 thing I've been looking at git
lately. No promises, though.

                   Linus