Re: Delta compression not so effective

Marius Storm-Olsen <mstormo@xxxxxxxxx> · Sat, 4 Mar 2017 02:27:00 -0600

On 3/1/2017 18:43, Linus Torvalds wrote:
So, this repo must be knocking several parts of Git's insides. I was curious
about why it was so slow on the writing objects part, since the whole repo
is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
has ~400MB/s continuous throughput available.

iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
thread (since the "write objects" stage is single threaded, obviously).

So the writing phase isn't multi-threaded because it's not expected to
matter. But if you can't even generate deltas, you aren't just
*writing* much more data, you're compressing all that data with zlib
too.

So even with a fast disk subsystem, you won't even be able to saturate
the disk, simply because the compression will be slower (and
single-threaded).

I did a simple
    $ time zip -r repo.zip repo/
...
    total bytes=219353596620, compressed=214310715074 -> 2% savings

    real    154m6.323s
    user    133m5.209s
    sys     5m5.338s

also using a single thread + same disk, as git repack. But if you 
compare it to the numbers below, it's 2.6hrs with zip vs 14.2hrs 
(1:5.5). So it can't just be the overhead of having to compress the full 
blobs, due to lacking delta..

Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
blobs are the same DLLs (multiple of them)

I think the first thing you should test is to repack with fewer
threads, and a bigger pack window. Do somethinig like

  -c pack.threads=4 --window-memory=30g

instead. Just to see if that starts finding deltas.

I reran the repack with the options above (dropping the zlib=9, as you 
suggested)

    $ time git -c pack.threads=4 repack -a -d -F \
               --window=350 --depth=250 --window-memory=30g

    Delta compression using up to 4 threads.
    Compressing objects:   100% (609413/609413)
    Writing objects: 100% (666515/666515), done.
    Total 666515 (delta 499585), reused 0 (delta 0)

    real	850m3.473s
    user	897m36.280s
    sys 	10m8.824s

and ended up with
    $ du -sh .
    205G	.

In other words, going from 6G to 30G window didn't help a lick on 
finding deltas for those binaries. (205G was what I had with the 
non-aggressive 'git gc', before zlib=9 repack.)

BUT, oddly enough, even if the new size if almost identical to the 
previous version without zlib=9,
    git verify-pack --verbose 
objects/pack/pack-29b06ae4d458ac03efd98b330702d30e851b2933.idx | sort 
-k3n | tail -n15
gives me a VERY different list than before

  17e5b2146311256dc8317d6e0ed1291363c31a76 blob   673399562 110248747 
190398904084
  04c881d9069eab3bd0d50dd48a047a60f79cc415 blob   673863358 111710559 
188818868865
  fdcabd75aeda86ce234d6e43b54d27d993acddcd blob   674523614 111956017 
185706433825
  d8815033d1b00b151ae762be8a69ffa35f55c4b4 blob   675286758 112099638 
185153570292
  997e0b9d3bcf440af10c7bbe535a597ca46c492c blob   678274978 112654668 
184041692883
  dfed141679e5c33caaa921cbe1595a24967a3c2c blob   681692132 113121410 
186753502634
  76a4000e71cd5b85f2265e02eb876acf1f33cc55 blob   682673430 112743915 
184563542298
  81e7292c4d2da2d2d236fbfaa572b6c4e8d787f4 blob   684543130 112797325 
181805773038
  991184c60e1fc6b2721bf40f181012b72b10d02d blob   684543130 112796892 
182344388066
  0e9269f4abd1440addd05d4f964c96d74d11cd89 blob   684547270 112809074 
181070719237
  6019b6d09759cf5adeac678c8b56d177803a0486 blob   684547270 112809336 
180517242193
  70a5f70bd205329472d6f9c660eb3f7d207a596e blob   686852038 112873611 
183520467528
  e86a0064d9652be9f5e3a877b11a665f64198ecd blob   686852038 112874133 
182893219377
  bae8de0555be5b1ffa0988cbc6cba698f6745c26 blob   894041802 137223252 
2355250324
  94dc773600e03ac1e6f3ab077b70b8297325ad77 blob   945197364 145219485 
16560137220

compared to the last 3 entries of the previous pack
  e9916da851962265a9d5b099e72f60659a74c144 blob   170113524 73514361 
966299538
  f7bf1313752deb1bae592cc7fc54289aea87ff19 blob   170113524 70756581 
1039814687
  8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob   248959314 237612609 
606692699

So the first thing you might want to do is to just print out the
objects after sorting them, and before it starts trying to finsd
deltas.
...
and notice that QSORT() line: that's what sorts the objects. You can
do something like

                for (i = 0; i < n; i++)
                        show_object_entry_details(delta_list[i]);

I did
    fprintf(stderr, "%s %u %lu\n",
            sha1_to_hex(delta_list[i]->idx.sha1),
            delta_list[i]->hash,
            delta_list[i]->size);

I assume that's correct?

In fact, if your data is not *so* sensitive, and you're ok with making
the one-line commit logs and the filenames public, you could make just
those things available, and maybe I'll have time to look at it.

I've removed all commit messages, and "sanitized" some filepaths etc, so 
name hashes won't match what's reported, but that should be fine. (the 
object_entry->hash seems to be just a trivial uint32 hash for sorting 
anyways)

I really don't want the files on the mailinglist, so I'll send you a 
link directly. However, small snippets for public discussions about 
potential issues would be fine, obviously.

BUT, if I look at the last 3 entries of the sorted git verify-pack 
output, and look for them in the 'git log --oneline --raw -R 
--abbrev=40' output, I get:
 :100644 100644 991184c60e1fc6b2721bf40f181012b72b10d02d 
e86a0064d9652be9f5e3a877b11a665f64198ecd M 
extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib
 :100644 000000 bae8de0555be5b1ffa0988cbc6cba698f6745c26 
0000000000000000000000000000000000000000 D 
extern/win/gdal-2.0.0/lib/x64/Debug/libgdal.lib
 :000000 100644 0000000000000000000000000000000000000000 
94dc773600e03ac1e6f3ab077b70b8297325ad77 A 
extern/win/gdal-2.0.0/lib/x64/Debug/gdal.lib

while I cannot find ANY of them in the delta_list output?? Shouldn't 
delta_list contain all objects, sorted by some heuristics? Or is the 
delta_list already here limited by some other metric, before the QSORT?

Also note that the 'git log --oneline --raw -R --abbrev=40' only gave me 
the log for trunk, so for the second last object, must have been added 
in a branch, and deleted on trunk; so I could only see the deletion of 
that object in the output.

You might get an idea for how to easily create a repo which reproduces 
the issue, and which would highlight it more easily for the ML.

I was thinking of maybe scripting up
    make install prefix=extern
for each Git release, and rewrite trunk history with extern/ binary 
commits at the time of each tag; maybe that would show the same 
behavior? But then again, most of the binaries are just copies of each 
other, and only ~10M, so probably not a big win.

Thanks!

--
.marius