Re: kernel.org mirroring (Re: [GIT PULL] MMC update)

Linus Torvalds <torvalds@xxxxxxxx> · Sun, 10 Dec 2006 11:50:15 -0800 (PST)

On Sun, 10 Dec 2006, Jakub Narebski wrote:
> >> If-Modified-Since:, If-Match:, If-None-Match: do you?
> 
> Adn in CGI standard there is a way to access additional HTTP headers
> info from CGI script: the envirionmental variables are HTTP_HEADER,
> for example if browser sent If-Modified-Since: header it's value
> can be found in HTTP_IF_MODIFIED_SINCE environmental variable.

Guys, you're missing something fairly fundamnetal. 

It helps almost _nothing_ to support client-side caching with all these 
fancy "If-Modified-Since:" etc crap.

That's not the _problem_.

It's usually not one client asking for the gitweb pages: the load comes 
from just lots of people independently asking for it. So client-side 
caching may help a tiny tiny bit, but it's not actually fixing the 
fundamental problem at all.

So forget about "If-Modified-Since:" etc. It may help in benchmarks when 
you try it yourself, and use "refresh" on the client side. But the basic 
problem is all about lots of clients that do NOT have things cached, 
because all teh client caches are all filled up with pr0n, not with gitweb 
data from yesterday.

So the thing to help is server-side caching with good access patterns, so 
that the server won't have to seek all over the disk when clients that 
_don't_ have things in their caches want to see the "git projects" summary 
overview (that currently lists something like 200+ projects).

So to get that list of 200+ projects, right now gitweb will literally walk 
them all, look at their refs, their descriptions, their ages (which 
requires looking up the refs, and the objects behing the refs), and if 
they aren't cached, you're going to have several disk seeks for each 
project.

At 200+ projects, the thing that makes it slow is those disk seeks. Even 
with a fast disk and RAID array, the seeks are all basically going to be 
interdependent, so there's no room for disk arm movement optimization, and 
in the absense of any other load it's still going to be several seconds 
just for the seeks (say 10ms per seek, four or five seeks per project, 
you've got 10 seconds _just_ for the seeks to generate the top-level 
summary page, and quite frankly, five seeks is probably optimistic).

Now, hopefully some of it will be in the disk cache, but when the 
mirroring happens, it will basically blow the disk caches away totally 
(when using the "--checksum" option), and then you literally have tens of 
seconds to generate that one top-level page. 

And when mirroring is blowing out the disk caches, the thing will be doing 
other things _too_ to the disk, of course.

So what you want is server-side caching, and you basically _never_ want to 
re-generate that data synchronously (because even if the server can take 
the load, having the clients wait for half a minute or more for the data 
is just NOT FRIENDLY). This is why I suggested the grace-period where we 
fill the cache on he server side in the background _while_at_the_same_time 
actually feeding the clients the old cached contents.

Because what matters most to _clients_ is not getting the most recent 
up-to-date data within the last few minutes - people who go to the 
overview page want to just get a list of projects, and they want to get 
them in a second or two, not half a minute later.

And btw, all those "If-Modified-Since:" things are irrelevant, since quite 
often, the top-level page really technically _has_ been modified in the 
last few minutes, because with the kernel and git projects, _somebody_ has 
usually pushed out one of the projects within the last hour.

And no, people don't just sit there refreshing their browser page all the 
time. I bet even "active" git users do it at most once or twice a day, 
which means that their client cache will _never_ be up-to-date.

But if you do it with server-side caches and grace-periods, you can 
generally say "we have something that is at most five minutes old", and 
most importantly, you can hopefully do it without a lot of disk seeks 
(because you just cache the _one_ page as _one_ object), so hopefully you 
can do it in a few hundred ms even if the thing is on disk and even if 
there's a lot of other load going on.

I bet the top-level "all projects" summary page and the individual 
project summary pages are the important things to cache. That's what 
probably most people look at, and they are the ones that have lots of 
server-side cache locality. Individual commits and diffs probably don't 
get the same kind of "lots of people looking at them" and thus don't get 
the same kind of benefit from caching.

(Individual commits hopefully also need fewer disk seeks, at least with 
packed repositories. So even if you have to re-generate them from scratch, 
they won't have the seek times themselves taking up tens of seconds, 
unless the project is entirely unpacked and diffing just generates total 
disk seek hell)

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html