Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`

Jeff King <peff@xxxxxxxx> · Mon, 6 Feb 2017 17:07:55 -0500

On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote:

> I haven't hit insurmountable performance problems, even on
> low-end hardware; especially since I started storing blob ids in
> Xapian itself, avoiding the expensive tree lookup via git.

The painful thing is traversing the object graph for clones and fetches.
Bitmaps help, but you still have to generate them.

> The main problem seems to be tree size.  Deepening (2/2/36 vs
> 2/38) might be an option (I think Peff brought that up); but it
> might be easier to switch to YYYYMM refs (working like
> logrotate) and rely on Xapian to tie the entire thing together.

Yes, the hashing is definitely one issue. Some numbers here:

  http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@xxxxxxxxxxxxxxxxxxxxx/

If you have C commits on a tree with T entries, you have to do C*T hash
lookups for a flat tree (for each commit, you have to see "yup, already
saw that object"). Sharding that across H entries at the top level drops
the tree cost from T to H + T/H (actually, it's a bit worse because we
have to read the secondary tree, too). Sharding again (at H') gets you
H + H' + T/H/H'.

Let's imagine you do one message per commit, so C=T. At 400K messages,
that's about 160 billion hash lookups flat. At H=256, it's about 700
million. If you shard again with H'=256, it's 200 million. After that,
the additive terms start to dominate, and it's not worth going any
further (and also, we're ignoring the extra-tree cost to each level).

At that point you're better off to start having fewer commits. I know
that the schema you use does put useful information into the commit
message, but it's also redundant with what's in the messages themselves.
And it sounds like you push most of that out to Xapian anyway.

Imagine your repo had one commit with 400K historical messages, and then
grouped the new messages so that on average we got about 10 messages per
commit (this doesn't seem unrealistic for something that commits every
few minutes; the messages tend to be bunched in time; I ran some
numbers against a 10-minute mark in the earlier message).

Then after another 100K messages, we'd have C=10,001 and T=500K. With
two levels of hashing at 256 each, that's ~5 million hash lookups to
walk the graph. And those numbers would be reasonable for a hosting site
like GitHub.

I don't know what C is for the kernel repo, but I suspect with the right
tuning it could be made into large-but-reasonable.

-Peff