Re: [PATCH 0/7] textconv caching

Jeff King <peff@xxxxxxxx> · Fri, 2 Apr 2010 02:14:21 -0400

On Thu, Apr 01, 2010 at 08:01:59PM -0400, Jeff King wrote:

>   [before]
>   $ time git show >/dev/null
>   real    0m13.724s
>   user    0m12.057s
>   sys     0m1.624s
> 
>   [after (with cache primed)]
>   $ time git show >/dev/null
>   real    0m0.009s
>   user    0m0.004s
>   sys     0m0.004s

Since this is a space-time tradeoff, I thought it would make sense to
show some size numbers as a followup.

To get a sense of the size of the repo (it's almost all photos and
videos):

  [size of the repo, already fully packed]
  $ du -sh .git/objects
  4.0G    .git/objects

  [the number of unique blobs through all history; most are binary media]
  $ git log --raw --no-abbrev | awk '/^:/ {print $3 "\n" $4}' | sort -u | wc -l
  10605

In comparison, the metadata for a given file (produced by the textconv)
is about 200 bytes of text.

So I did a big cache priming:

  $ time git log -p >/dev/null
  real    39m29.748s
  user    23m1.090s
  sys     3m46.642s

Slow, and unsurprisingly spends quite a bit of time waiting on I/O. The
result is a notes tree with almost one textconv per blob:

  $ git ls-tree -r notes/textconv/mfo | wc -l
  10317

We're now using almost 200M:

  $ git count-objects
  39513 objects, 198604 kilobytes

But wait. Many of those objects are trees for stale versions of the
cache.

  $ git repack -d
  $ (cd .git/objects/pack && du -k *.pack)
  2056    pack-34170e72ec40a07e98aae044479abccc9e02751b.pack
  4089224 pack-81797628f3aebf6a0bdc082fa05ec14932910534.pack
  $ git count-objects
  30685 objects, 163288 kilobytes

In actuality, a fully packed cache is only about 2M (from 35M of
loose objects; it deltas quite well because there is a lot of overlap
in my metadata). And we can prune away the other 160M of cruft:

  $ git prune
  $ git count-objects
  0 objects, 0 kilobytes

And of course, the final speed result:

  $ time git log -p >/dev/null
  real    0m7.606s
  user    0m6.084s
  sys     0m0.788s

So what I take away from this is two things:

  1. The size tradeoff is definitely worthwhile for some workloads. In
     this case, the textconv version is orders of magnitude smaller than
     the original. I'd be interested to see numbers for something like a
     repository of documents that get textconv'd to pure ascii.

  2. We had 460% loose object overhead just from tree objects in
     intermediate versions of the cache. While it was not too hard to
     get rid of with a repack and a prune, we are probably better off
     not generating it in the first place. In theory we could have
     written only one notes tree, and kept the intermediate state in
     memory. In practice, flushing once per commit-diff (instead of once
     per file) would probably be fine, and would be simpler to
     implement.

And of course, now that I have a completely primed cache, I can push it
around with "git push $dest notes/textconv/mfo". Yay for storing notes
as git objects.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html