Re: Delta compression not so effective

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/1/2017 12:30, Linus Torvalds wrote:
On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen <mstormo@xxxxxxxxx> wrote:

Indeed, I did do a
    -c pack.threads=20 --window-memory=6g
to 'git repack', since the machine is a 20-core (40 threads) machine with
126GB of RAM.

So I guess with these sized objects, even at 6GB per thread, it's not enough
to get a big enough Window for proper delta-packing?

Hmm. The 6GB window should be plenty good enough, unless your blobs
are in the gigabyte range too.

No, the list of git verify-objects in the previous post was from the bottom of the sorted list, so those are the largest blobs, ~249MB..


This repo took >14hr to repack on 20 threads though ("compression" step was
very fast, but stuck 95% of the time in "writing objects"), so I can only
imagine how long a pack.threads=1 will take :)

Actually, it's usually the compression phase that should be slow - but
if something is limiting finding deltas (so that we abort early), then
that would certainly tend to speed up compression.

The "writing objects" phase should be mainly about the actual IO.
Which should be much faster *if* you actually find deltas.

So, this repo must be knocking several parts of Git's insides. I was curious about why it was so slow on the writing objects part, since the whole repo is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing has ~400MB/s continuous throughput available.

iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single thread (since the "write objects" stage is single threaded, obviously).

The failing delta must be triggering other negative behavior.


For example, the sorting code thinks that objects with the same name
across the history are good sources of deltas. But it may be that for
your case, the binary blobs that you have don't tend to actually
change in the history, so that heuristic doesn't end up doing
anything.

These are generally just DLLs (debug & release), which content is updated due to upstream project updates. So, filenames/paths tend to stay identical, while content changes throughout history.


The sorting does use the size and the type too, but the "filename
hash" (which isn't really a hash, it's something nasty to give
reasonable results for the case where files get renamed) is the main
sort key.

So you might well want to look at the sorting code too. If filenames
(particularly the end of filenames) for the blobs aren't good hints
for the sorting code, that sort might end up spreading all the blobs
out rather than sort them by size.

Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed blobs are the same DLLs (multiple of them)


And again, if that happens, the "can I delta these two objects" code
will notice that the size of the objects are wildly different and
won't even bother trying. Which speeds up the "compressing" phase, of
course, but then because you don't get any good deltas, the "writing
out" phase sucks donkey balls because it does zlib compression on big
objects and writes them out to disk.

Right, now on this machine, I really didn't notice much difference between standard zlib level and doing -9. The 203GB version was actually with zlib=9.


So there are certainly multiple possible reasons for the deltification
to not work well for you.

Hos sensitive is your material? Could you make a smaller repo with
some of the blobs that still show the symptoms? I don't think I want
to download 206GB of data even if my internet access is good.

Pretty sensitive, and not sure how I can reproduce this reasonable well. However, I can easily recompile git with any recommended instrumentation/printfs, if you have any suggestions of good places to start? If anyone have good file/line numbers, I'll give that a go, and report back?

Thanks!

--
.marius



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]