Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`

Eric Wong <e@xxxxxxxxx> · Tue, 7 Feb 2017 00:14:46 +0000

Jeff King <peff@xxxxxxxx> wrote:
> On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote:
> 
> > I haven't hit insurmountable performance problems, even on
> > low-end hardware; especially since I started storing blob ids in
> > Xapian itself, avoiding the expensive tree lookup via git.
> 
> The painful thing is traversing the object graph for clones and fetches.
> Bitmaps help, but you still have to generate them.

Yep.  "public-inbox-init" defaults to enabling bitmaps in the
config for this reason.

> > The main problem seems to be tree size.  Deepening (2/2/36 vs
> > 2/38) might be an option (I think Peff brought that up); but it
> > might be easier to switch to YYYYMM refs (working like
> > logrotate) and rely on Xapian to tie the entire thing together.
> 
> Yes, the hashing is definitely one issue. Some numbers here:
> 
>   http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@xxxxxxxxxxxxxxxxxxxxx/
> 
> If you have C commits on a tree with T entries, you have to do C*T hash
> lookups for a flat tree (for each commit, you have to see "yup, already
> saw that object"). Sharding that across H entries at the top level drops
> the tree cost from T to H + T/H (actually, it's a bit worse because we
> have to read the secondary tree, too). Sharding again (at H') gets you
> H + H' + T/H/H'.
> 
> Let's imagine you do one message per commit, so C=T. At 400K messages,
> that's about 160 billion hash lookups flat. At H=256, it's about 700
> million. If you shard again with H'=256, it's 200 million. After that,
> the additive terms start to dominate, and it's not worth going any
> further (and also, we're ignoring the extra-tree cost to each level).

Just to make sure I'm following, here; the entire formulas are:

	C * H + H' + (T / H / H')     # 2/2/36
	C * H + (T / H)               # 2/38 (current)

Right?

> At that point you're better off to start having fewer commits. I know
> that the schema you use does put useful information into the commit
> message, but it's also redundant with what's in the messages themselves.
> And it sounds like you push most of that out to Xapian anyway.

Yeah, there's no benefit to Xapian users for having any info in
the commit.  However, keeping commit-per-message is still
important to me to for better robustness from hardware and
network failures.

But yes, historical stuff could be squashed into a single commit
(much like how linux.git started with v2.6.12-rc2 without
history).  Perhaps some folks will care about NNTP article
numbering being non-chronological...

> Imagine your repo had one commit with 400K historical messages, and then
> grouped the new messages so that on average we got about 10 messages per
> commit (this doesn't seem unrealistic for something that commits every
> few minutes; the messages tend to be bunched in time; I ran some
> numbers against a 10-minute mark in the earlier message).
> 
> Then after another 100K messages, we'd have C=10,001 and T=500K. With
> two levels of hashing at 256 each, that's ~5 million hash lookups to
> walk the graph. And those numbers would be reasonable for a hosting site
> like GitHub.
> 
> I don't know what C is for the kernel repo, but I suspect with the right
> tuning it could be made into large-but-reasonable.

LKML probably has an upper bound of 30K messages per month;
so it could hit 100K in less than 4 months.  Worst case might
be 360K messages a year

	360000 * (256 + 256 + ((360000 + old) / 256 / 256))

That's still at least 180 million hash lookups after a year or
so of real-time updates; right?  (But probably closer to 240
million if there's 10 million old messages in there.

Instead, I think I will add an option to support logrotate-style
monthly heads (YYYYMM); keeping 2/38 and C == T:

	30000 * (256 + (30000 / 256))               => 11 million
	30000 * (256 + 256 + (30000 / 256 / 256))   => 15 million

The monthly heads would each be discontiguous history-wise;
so Xapian would become a requirement for users of this option
for Message-ID lookups, but histories would still be readable
with "git log"

One good side-effect of using monthly heads is --single-branch
clones may be used if someone lacks the bandwidth or space to do
a full mirror.  I'm not sure if the server-side (pack reuse,
bitmaps) will benefit other aside from bandwidth reductions,
though.

A (far-fetched) option I've considered would be to store entire
messages in the commit and have no trees or blobs at all.  But
that would require a significant rework, and would also make
Xapian a hard requirement for even checking if a message is
deleted or not.