Re: gc --aggressive

Jeff King <peff@xxxxxxxx> · Tue, 1 May 2012 12:28:06 -0400

On Sun, Apr 29, 2012 at 09:53:31AM -0400, Nicolas Pitre wrote:

> But my remark was related to the fact that you need to double the 
> affected resources to gain marginal improvements at some point.  This is 
> true about computing hardware too: eventually you need way more gates 
> and spend much more $$$ to gain some performance, and the added 
> performance is never linear with the spending.

Right, I agree with that. The trick is just finding the right spot on
that curve for each repo to maximize the reward/effort ratio.

> >   1. Should we bump our default window size? The numbers above show that
> >      typical repos would benefit from jumping to 20 or even 40.
> 
> I think this might be a good indication that the number of objects is a 
> bad metric to size the window, as I mentioned previously.
> 
> Given that you have the test repos already, could you re-run it with 
> --window=1000 and play with --window-memory instead?  I would be curious 
> to see if this provides more predictable results.

It doesn't help. The git.git repo does well with about a 1m window
limit. linux-2.6 is somewhere between 1m and 2m. But the phpmyadmin repo
wants more like 16m. So it runs into the same issue as using object
counts.

But it's much, much worse than that. Here are the actual numbers (same
format as before; left-hand column is either window size (if no unit) or
window-memory limit (if k/m unit), followed by resulting pack size, its
percentage of baseline --window=10 pack, the user CPU time and finally
its percentage of the baseline):

  git:
    10 | 31.4M (100%) |   54s (100%)
    20 | 28.8M ( 92%) |   72s (133%)
  128k | 81.4M (260%) |   77s (142%)
  256k | 59.1M (188%) |  106s (195%)
  512k | 44.5M (142%) |  166s (306%)
    1m | 28.7M ( 91%) |  267s (491%)
    2m | 27.0M ( 86%) |  347s (637%)
    4m | 26.0M ( 83%) |  417s (767%)

  linux-2.6:
    10 |  564M (100%) |  990s (100%)
    20 |  521M ( 92%) | 1323s (134%)
  128k | 1.41G (256%) | 1322s (133%)
  256k | 1.08G (196%) | 1810s (183%)
  512k |  783M (139%) | 2775s (280%)
    1m |  579M (103%) | 4620s (466%)
    2m |  504M ( 89%) | 6786s (685%)
    4m |  479M ( 85%) | 8119s (819%)

  phpmyadmin:
    10 |  380M (100%) | 1617s (100%)
    80 |  163M ( 43%) | 3410s (211%)
  128k | 3.42G (921%) | 2367s (146%)
  256k | 3.36G (904%) | 2437s (151%)
  512k | 3.22G (865%) | 2589s (160%)
    1m | 3.10G (833%) | 2746s (170%)
    2m |  436M (115%) | 1674s (104%)
    4m |  299M ( 78%) | 2140s (132%)
    8m |  222M ( 58%) | 2751s (170%)
   16m |  178M ( 47%) | 3334s (206%)

I intentionally started with a too-small memory limit so we could see
the effect as the window size approached something reasonable. You can
see the pack sizes getting comparable for --window=20 around
--window-memory=1m for the git and linux-2.6 cases. But look at the CPU
usage. For a comparable resulting pack size, limiting the window memory
uses 4-5x as much CPU.

I'm not sure what is causing that behavior. I guess maybe for small
objects we end up with a really huge window (in terms of number of
objects), but it doesn't end up actually saving us a lot of space
because there is not as much space to be saved with small objects. So we
spend a lot of extra time looking at objects that don't yield big space
savings.

For some of the really tiny limits, the "writing" phase ended up
dominating. For example, linux-2.6 at 128k ends up with a horribly large
pack that takes even longer to run than --window=10. These numbers don't
reflect the split between the compression and writing phases, but I
noticed while watching the progress meter that the writing phase was
quite slow in such cases. Mostly because we end up having to zlib
deflate a lot more data (which I confirmed via perf).

Interestingly, the phpmyadmin script does not have the same issue. The
CPU usage for the object and memory limits are about the same (probably
this is due to the fact that the history is dominated by similar-sized
.po files, so the two limits end up equating to each other).

> >   2. Is there a heuristic or other metric we can figure out to
> >      differentiate the first two repositories from the third, and use a
> >      larger window size on the latter?
> 
> Maybe we could look at the size reduction within the delta search loop.  
> If the reduction quickly diminishes as tested objects are further away 
> from the target one then the window doesn't have to be very large, 
> whereas if the reduction remains more or less constant then it might be 
> worth searching further.  That could be used to dynamically size the 
> window at run time.

I really like the idea of dynamically sizing the window based on what we
find. If it works. I don't think there's any reason you couldn't have 50
absolutely terrible delta candidates followed by one really amazing
delta candidate. But maybe in practice the window tends to get
progressively worse due to the heuristics, and outliers are unlikely. I
guess we'd have to experiment.

> >   3. Does the phpmyadmin case give us any insight into whether we can
> >      improve our window sorting algorithm?
> [...]
> 
>  You could test this theory by commenting out the size comparisons in 
> type_size_sort() and re-run the test.

I'll try this next.

-Peff

PS Here's my updated collection script, just for reference.

-- >8 --
#!/bin/sh

windows='10 20 128k 256k 512k 1m 2m 4m'
repo=$1; shift
test $# -gt 0 && windows="$*"

cd "$repo" || exit
for i in $windows; do
  case "$i" in
  *[kmg])
    opts="--window=1000 --window-memory=$i" ;;
  *)
    opts="--window=$i" ;;
  esac

  echo >&2 "Repacking $repo with $opts..."

  time -f %U -o time.out \
    git pack-objects --stdout --all-progress-implied --all \
      --no-reuse-delta $opts </dev/null |
    wc -c >size.out
  echo "$i `cat size.out` `cat time.out`"
  rm -f size.out time.out
done
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html