On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote: > I haven't hit insurmountable performance problems, even on > low-end hardware; especially since I started storing blob ids in > Xapian itself, avoiding the expensive tree lookup via git. The painful thing is traversing the object graph for clones and fetches. Bitmaps help, but you still have to generate them. > The main problem seems to be tree size. Deepening (2/2/36 vs > 2/38) might be an option (I think Peff brought that up); but it > might be easier to switch to YYYYMM refs (working like > logrotate) and rely on Xapian to tie the entire thing together. Yes, the hashing is definitely one issue. Some numbers here: http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@xxxxxxxxxxxxxxxxxxxxx/ If you have C commits on a tree with T entries, you have to do C*T hash lookups for a flat tree (for each commit, you have to see "yup, already saw that object"). Sharding that across H entries at the top level drops the tree cost from T to H + T/H (actually, it's a bit worse because we have to read the secondary tree, too). Sharding again (at H') gets you H + H' + T/H/H'. Let's imagine you do one message per commit, so C=T. At 400K messages, that's about 160 billion hash lookups flat. At H=256, it's about 700 million. If you shard again with H'=256, it's 200 million. After that, the additive terms start to dominate, and it's not worth going any further (and also, we're ignoring the extra-tree cost to each level). At that point you're better off to start having fewer commits. I know that the schema you use does put useful information into the commit message, but it's also redundant with what's in the messages themselves. And it sounds like you push most of that out to Xapian anyway. Imagine your repo had one commit with 400K historical messages, and then grouped the new messages so that on average we got about 10 messages per commit (this doesn't seem unrealistic for something that commits every few minutes; the messages tend to be bunched in time; I ran some numbers against a 10-minute mark in the earlier message). Then after another 100K messages, we'd have C=10,001 and T=500K. With two levels of hashing at 256 each, that's ~5 million hash lookups to walk the graph. And those numbers would be reasonable for a hosting site like GitHub. I don't know what C is for the kernel repo, but I suspect with the right tuning it could be made into large-but-reasonable. -Peff