>>> I posted separately about those. And I've been mulling about whether >>> the thundering herd is really such a big problem that we need to >>> address it head-on. >> >> Uhm... yes it is. > > Got some more info, discussion points or links to stuff I should read > to appreciate why that is? I am trying to articulate why I consider it > is not a high-payoff task, as well as describing how to tackle it. > > To recap, the reasons it is not high payoff is that: > > - the main benefit comes from being cacheable and able to revalidate > the cache cheaply (with the ETags-based strategy discussed above) > - highly distributed caches/proxies means we'll seldom see a true > cold cache situation > - we have a huge set of URLs which are seldom hit, and will never see > a thundering anything > - we have a tiny set of very popular URLs that are the key target for > the thundering herd - (projects page, summary page, shortlog, fulllog) > - but those are in the clear as soon as the caches are populated > > Why do we have to take it head-on? :-) I think I agree with you, but not as strongly. Certainly, having any kind of effective cacheing (heck, just comparing the timestamp of the relevant ref(s) with the If-Modified-Since: header) will help kernel.org enormously. But as soon as there's a push, particularly a release push, that invalidates *all* of the popular pages *and* the thindering herd arrives. The result is that all of the popular "what's new?" summary pages get fetched 15 times in parallel and, because the front end doesn't serialize them, populating the caches can be a painful process involving a lot of repeated work. I tend to agree that for the basic project summary pages, generating them preemptively as static pages out of the push script seems best. ("find /usr/src/linux -type d -print | wc -l" is 1492. Dear me. Oh! There is no per-directory shortlog page; that simplifies things. But there *should* be.) The only tricky thing is the "n minutes/hours/days ago" timestamps. Basically, you want to generate a half-formatted, indefinitely-cacheable page that contains them as absolute timestamps, and a have system for regenerating the fully-formatted page from that (and the current time). The ideas that people have been posting seem excellent. Give a page two timeouts. If a GET arrives before the first timestamp, and no prerequisites have changes, it's served directly from cache. If it arrives after the second timeout, or the prerequisites have changed, it blocks until the page is regenerated. But if it arrives between those two times, it serves the stale data and starts generating fresh data in the background. So for the fully-formed timestamps, the first timeout is when the next human-readable timestamp on the page ticks over. But the second timeout can be past that by, say, 5% of the timeout value. It's okay to display "3 hours ago" until 12 minutes past the 4 hour mark. It might be okay to allow even the prerequisites to be slightly stale when serving old data; it's okay if it takes 30 seconds for the kernel.org web page to notice that Linus pushed. But on my office gitweb, I'm not sure that it's okay to take 30 seconds to notice that *I* just pushed. (I'm also not sure about consistency issues. If I link from one page that shows the new release to another, it would be a bit disconcerting if it disappeared.) The nasty problem with built-in cacheing is that you need a whole cache reclaim infrastructure; it would be so much nicer to let Squid deal with that whole mess. But it can't deal with anything other than fully rendered HTML. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html