On Wed, Dec 12 2018, Jeff King wrote: > On Wed, Dec 12, 2018 at 03:01:47AM +0000, Iucha, Florin wrote: > >> I am running “git-repack -A -d -f -F --window=250 --depth=250” on a >> Git repository converted using git-svn. > > Sort of tangential to your question, but: > > - Using "-F" implies "-f" already, so you don't need both. That said, > you are probably wasting CPU to use "-F", unless you have adjusted > zlib compression settings since the last pack. (Whereas using "-f" > is useful, if you're willing to pay the CPU tradeoff). > > - Using --depth=250 does not actually decrease the packfile size very > much, and results in a packfile which is more expensive for > subsequent processes to use. Some experimentation showed that > --depth=50 is a sweet spot, and that is the default for both normal > "git gc" and "git gc --aggressive" these days. > > See 07e7dbf0db (gc: default aggressive depth to 50, 2016-08-11) for > more discussion. I wonder if the size is still too high. I'd rather take the 5-10% speedup quoted in that commit than a slight decrease in disk space use. The git.git repository now with the repack command in that commit message, with --depth=$n: 573M /tmp/git-1 200M /tmp/git-2 118M /tmp/git-3 91M /tmp/git-4 82M /tmp/git-5 77M /tmp/git-6 74M /tmp/git-7 72M /tmp/git-8 71M /tmp/git-9 70M /tmp/git-10 67M /tmp/git-15 66M /tmp/git-20 65M /tmp/git-40 65M /tmp/git-35 65M /tmp/git-30 65M /tmp/git-25 64M /tmp/git-50 64M /tmp/git-45 Produced via: parallel 'rm -rf /tmp/git-{}; cp -Rvp /tmp/git.git/ /tmp/git-{} && git -C /tmp/git-{} repack -adf --depth={} --window=250' ::: {1..10} 15 20 25 30 35 40 45 50 And then the log -S benchmarks: s/iter log -S 1 log -S 50 log -S 45 log -S 35 log -S 30 log -S 25 log -S 20 log -S 15 log -S 10 log -S 5 log -S 1 12.6 -- -26% -27% -32% -36% -38% -38% -39% -40% -42% log -S 50 9.28 36% -- -0% -7% -13% -15% -16% -18% -19% -21% log -S 45 9.25 36% 0% -- -7% -12% -15% -16% -17% -19% -20% log -S 35 8.62 46% 8% 7% -- -6% -9% -10% -11% -13% -14% log -S 30 8.12 55% 14% 14% 6% -- -3% -4% -6% -8% -9% log -S 25 7.86 60% 18% 18% 10% 3% -- -1% -3% -5% -6% log -S 20 7.77 62% 19% 19% 11% 4% 1% -- -2% -3% -5% log -S 15 7.64 65% 21% 21% 13% 6% 3% 2% -- -2% -4% log -S 10 7.51 68% 24% 23% 15% 8% 5% 3% 2% -- -2% log -S 5 7.37 71% 26% 25% 17% 10% 7% 5% 4% 2% -- So we're getting a 20% speedup in log -S by using --depth=5 instead of --depth=50, neatly at the cost of 20% more disk space, but could also get a 19% speedup for 8% (--depth=10) more disk space, etc. Then in rev-list it makes less of a difference, narrowing this down a bit: s/iter rev-list 50 rev-list 25 rev-list 10 rev-list 5 rev-list 50 1.61 -- -1% -4% -5% rev-list 25 1.60 1% -- -3% -4% rev-list 10 1.55 4% 3% -- -1% rev-list 5 1.54 5% 4% 1% -- It seems to me we should at least move this to --depth=25 by default. A ~15% speedup in log -S & other mass-blob lookups for 1.5% more disk space seems like a good trade-off. As your commit notes RAM (or disk space) is not commonly the bottleneck anymore. >> The system is a 16 core / 32 thread Threadripper with 128GB of RAM and >> NVMe storage. The repack starts strong, with 32 threads but it fairly >> quickly gets to 99% done and the number of threads drops to 4 then 3 >> then 2. However, running “dstat 5” I see lots of “sys” time without >> any IO time (the network traffic you see is caused by SSH). > > This sounds mostly normal and expected. The parallel part of a repack is > the delta search, which is not infinitely parallelizable. Each worker > thread is given a "chunk" of objects, and it uses a sliding window to > search for delta pairs through that chunk. You don't want a chunk that > approaches the window size, since at every chunk boundary you're missing > delta possibilities. > > The default chunk size is about 1/nr_threads of the total list size > (i.e., we portion out all the work). And then when a thread finishes, we > take work from the thread with the most work remaining, and portion it > out. However, at some point the chunks approach their minimum, and we > stop dividing. So the number of threads will drop, eventually to 1, and > you'll be waiting on it to finish that final chunk. > > So that's all working as planned. Having high sys load does seem a bit > odd. Most of the effort should be going to reading the mmap'd data from > disk, zlib-inflating it and computing a fingerprint, and then comparing > the fingerprints. So that would mostly be user time. > >> Running a strace on the running git-repack process shows only these: >> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- >> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- >> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- >> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- >> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- >> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- >> >> Any idea on how to debug this? I have ran git-repack under gdb, but it seems to spin on builtin/repack.c line 409. > > The heavy lifting here is done by the pack-objects child process, not > git-repack itself. Try running with GIT_TRACE=1 in the environment to > see the exact invocation, but timing and debugging: > > git pack-objects --all --no-reuse-delta --delta-base-offset --stdout \ > </dev/null >/dev/null > > should produce interesting results. > > The SIGALRM loop you see above is likely just the progress meter > triggering once per second (the actual worker threads are updating an > int, and then at least once per second we'll show the int). > > -Peff