Re: kernel.org mirroring (Re: [GIT PULL] MMC update)

linux@xxxxxxxxxxx · 10 Dec 2006 22:40:18 -0500

>>> I posted separately about those. And I've been mulling about whether
>>> the thundering herd is really such a big problem that we need to
>>> address it head-on.
>>
>> Uhm... yes it is.
> 
> Got some more info, discussion points or links to stuff I should read
> to appreciate why that is? I am trying to articulate why I consider it
> is not a high-payoff task, as well as describing how to tackle it.
> 
> To recap, the reasons it is not high payoff is that:
>
>  - the main benefit comes from being cacheable and able to revalidate
>    the cache cheaply (with the ETags-based strategy discussed above)
>  - highly distributed caches/proxies means we'll seldom see a true
>    cold cache situation
>  - we have a huge set of URLs which are seldom hit, and will never see
>    a thundering anything
>  - we have a tiny set of very popular URLs that are the key target for
>    the thundering herd - (projects page, summary page, shortlog, fulllog)
>  - but those are in the clear as soon as the caches are populated
> 
> Why do we have to take it head-on? :-)

I think I agree with you, but not as strongly.  Certainly, having any
kind of effective cacheing (heck, just comparing the timestamp of the
relevant ref(s) with the If-Modified-Since: header) will help kernel.org
enormously.

But as soon as there's a push, particularly a release push, that
invalidates *all* of the popular pages *and* the thindering herd arrives.

The result is that all of the popular "what's new?" summary pages get
fetched 15 times in parallel and, because the front end doesn't serialize
them, populating the caches can be a painful process involving a lot of
repeated work.

I tend to agree that for the basic project summary pages, generating them
preemptively as static pages out of the push script seems best.
("find /usr/src/linux -type d -print | wc -l" is 1492.  Dear me.
Oh!  There is no per-directory shortlog page; that simplifies things.
But there *should* be.)

The only tricky thing is the "n minutes/hours/days ago" timestamps.
Basically, you want to generate a half-formatted, indefinitely-cacheable
page that contains them as absolute timestamps, and a have system for
regenerating the fully-formatted page from that (and the current time).

The ideas that people have been posting seem excellent.  Give a page
two timeouts.  If a GET arrives before the first timestamp, and no
prerequisites have changes, it's served directly from cache.  If it
arrives after the second timeout, or the prerequisites have changed,
it blocks until the page is regenerated.  But if it arrives between
those two times, it serves the stale data and starts generating fresh
data in the background.

So for the fully-formed timestamps, the first timeout is when the next
human-readable timestamp on the page ticks over.  But the second timeout
can be past that by, say, 5% of the timeout value.  It's okay to display
"3 hours ago" until 12 minutes past the 4 hour mark.

It might be okay to allow even the prerequisites to be slightly stale when
serving old data; it's okay if it takes 30 seconds for the kernel.org
web page to notice that Linus pushed.  But on my office gitweb, I'm not
sure that it's okay to take 30 seconds to notice that *I* just pushed.
(I'm also not sure about consistency issues.  If I link from one page
that shows the new release to another, it would be a bit disconcerting
if it disappeared.)

The nasty problem with built-in cacheing is that you need a whole cache
reclaim infrastructure; it would be so much nicer to let Squid deal
with that whole mess.  But it can't deal with anything other than fully
rendered HTML.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html