Re: Delta compression not so effective

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/1/2017 18:43, Linus Torvalds wrote:
So, this repo must be knocking several parts of Git's insides. I was curious
about why it was so slow on the writing objects part, since the whole repo
is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
has ~400MB/s continuous throughput available.

iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
thread (since the "write objects" stage is single threaded, obviously).

So the writing phase isn't multi-threaded because it's not expected to
matter. But if you can't even generate deltas, you aren't just
*writing* much more data, you're compressing all that data with zlib
too.

So even with a fast disk subsystem, you won't even be able to saturate
the disk, simply because the compression will be slower (and
single-threaded).

I did a simple
    $ time zip -r repo.zip repo/
...
    total bytes=219353596620, compressed=214310715074 -> 2% savings

    real    154m6.323s
    user    133m5.209s
    sys     5m5.338s

also using a single thread + same disk, as git repack. But if you compare it to the numbers below, it's 2.6hrs with zip vs 14.2hrs (1:5.5). So it can't just be the overhead of having to compress the full blobs, due to lacking delta..


Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
blobs are the same DLLs (multiple of them)

I think the first thing you should test is to repack with fewer
threads, and a bigger pack window. Do somethinig like

  -c pack.threads=4 --window-memory=30g

instead. Just to see if that starts finding deltas.

I reran the repack with the options above (dropping the zlib=9, as you suggested)

    $ time git -c pack.threads=4 repack -a -d -F \
               --window=350 --depth=250 --window-memory=30g

    Delta compression using up to 4 threads.
    Compressing objects:   100% (609413/609413)
    Writing objects: 100% (666515/666515), done.
    Total 666515 (delta 499585), reused 0 (delta 0)

    real	850m3.473s
    user	897m36.280s
    sys 	10m8.824s

and ended up with
    $ du -sh .
    205G	.

In other words, going from 6G to 30G window didn't help a lick on finding deltas for those binaries. (205G was what I had with the non-aggressive 'git gc', before zlib=9 repack.)

BUT, oddly enough, even if the new size if almost identical to the previous version without zlib=9, git verify-pack --verbose objects/pack/pack-29b06ae4d458ac03efd98b330702d30e851b2933.idx | sort -k3n | tail -n15
gives me a VERY different list than before

17e5b2146311256dc8317d6e0ed1291363c31a76 blob 673399562 110248747 190398904084 04c881d9069eab3bd0d50dd48a047a60f79cc415 blob 673863358 111710559 188818868865 fdcabd75aeda86ce234d6e43b54d27d993acddcd blob 674523614 111956017 185706433825 d8815033d1b00b151ae762be8a69ffa35f55c4b4 blob 675286758 112099638 185153570292 997e0b9d3bcf440af10c7bbe535a597ca46c492c blob 678274978 112654668 184041692883 dfed141679e5c33caaa921cbe1595a24967a3c2c blob 681692132 113121410 186753502634 76a4000e71cd5b85f2265e02eb876acf1f33cc55 blob 682673430 112743915 184563542298 81e7292c4d2da2d2d236fbfaa572b6c4e8d787f4 blob 684543130 112797325 181805773038 991184c60e1fc6b2721bf40f181012b72b10d02d blob 684543130 112796892 182344388066 0e9269f4abd1440addd05d4f964c96d74d11cd89 blob 684547270 112809074 181070719237 6019b6d09759cf5adeac678c8b56d177803a0486 blob 684547270 112809336 180517242193 70a5f70bd205329472d6f9c660eb3f7d207a596e blob 686852038 112873611 183520467528 e86a0064d9652be9f5e3a877b11a665f64198ecd blob 686852038 112874133 182893219377 bae8de0555be5b1ffa0988cbc6cba698f6745c26 blob 894041802 137223252 2355250324 94dc773600e03ac1e6f3ab077b70b8297325ad77 blob 945197364 145219485 16560137220

compared to the last 3 entries of the previous pack
e9916da851962265a9d5b099e72f60659a74c144 blob 170113524 73514361 966299538 f7bf1313752deb1bae592cc7fc54289aea87ff19 blob 170113524 70756581 1039814687 8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob 248959314 237612609 606692699


So the first thing you might want to do is to just print out the
objects after sorting them, and before it starts trying to finsd
deltas.
...
and notice that QSORT() line: that's what sorts the objects. You can
do something like

                for (i = 0; i < n; i++)
                        show_object_entry_details(delta_list[i]);

I did
    fprintf(stderr, "%s %u %lu\n",
            sha1_to_hex(delta_list[i]->idx.sha1),
            delta_list[i]->hash,
            delta_list[i]->size);

I assume that's correct?


In fact, if your data is not *so* sensitive, and you're ok with making
the one-line commit logs and the filenames public, you could make just
those things available, and maybe I'll have time to look at it.

I've removed all commit messages, and "sanitized" some filepaths etc, so name hashes won't match what's reported, but that should be fine. (the object_entry->hash seems to be just a trivial uint32 hash for sorting anyways)

I really don't want the files on the mailinglist, so I'll send you a link directly. However, small snippets for public discussions about potential issues would be fine, obviously.

BUT, if I look at the last 3 entries of the sorted git verify-pack output, and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I get: :100644 100644 991184c60e1fc6b2721bf40f181012b72b10d02d e86a0064d9652be9f5e3a877b11a665f64198ecd M extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib :100644 000000 bae8de0555be5b1ffa0988cbc6cba698f6745c26 0000000000000000000000000000000000000000 D extern/win/gdal-2.0.0/lib/x64/Debug/libgdal.lib :000000 100644 0000000000000000000000000000000000000000 94dc773600e03ac1e6f3ab077b70b8297325ad77 A extern/win/gdal-2.0.0/lib/x64/Debug/gdal.lib

while I cannot find ANY of them in the delta_list output?? Shouldn't delta_list contain all objects, sorted by some heuristics? Or is the delta_list already here limited by some other metric, before the QSORT?

Also note that the 'git log --oneline --raw -R --abbrev=40' only gave me the log for trunk, so for the second last object, must have been added in a branch, and deleted on trunk; so I could only see the deletion of that object in the output.


You might get an idea for how to easily create a repo which reproduces the issue, and which would highlight it more easily for the ML.

I was thinking of maybe scripting up
    make install prefix=extern
for each Git release, and rewrite trunk history with extern/ binary commits at the time of each tag; maybe that would show the same behavior? But then again, most of the binaries are just copies of each other, and only ~10M, so probably not a big win.


Thanks!

--
.marius



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]