Re: kernel.org mirroring (Re: [GIT PULL] MMC update)

"Martin Langhoff" <martin.langhoff@xxxxxxxxx> · Sat, 9 Dec 2006 14:56:28 +1300

On 12/9/06, Linus Torvalds <torvalds@xxxxxxxx> wrote:
Actually, just looking at the examples, it looks like memcached is
fundamentally flawed, exactly the same way Apache mod_cache is
fundamentally flawed.

I don't know if fundamentally flawed but (having used memcached) I
don't think it's a big win for this at all.

We can make gitweb to detect mod_perl and a few smarter things if it
is running inside of it. In fact, we can (ab)use mod_perl and perl
facilities a bit to do some serialization which will be a big win for
some pages. What we need for that is to set a sensible the ETag and
use some IPC to announce/check if other apache/modperl processes are
preparing content for the same ETag. The first-process-to-announce a
given ETag can then write it to a common temp directory (atomically -
write to a temp-name and move to the expected name) while other
processes wait, polling for the file. Once the file is in place the
latecomers can just serve the content of the file and exit.

(I am calling the "state we are serving" identifier ETag because I
think we should also set it as the ETag in the HTTP headers, so well
be able to check the ETag of future requests for staleness - all we
need is a ref lookup, and if the SHA1 matches, we are sorted). So
having this 'unique request identifier' doubles up nicely...

The ETag should probably be:
- SHA1+displaytype+args for pages that display an object identified by SHA1
- refname+SHA!+displaytype+args for pages that display something
identified by a ref
- SHA1(names and sha1s of all refs) for the summary page

You can't have a cache architecture where the client just does a "get",
like memcached does. You need to have a "read-for-fill" operation, which
says:

You _could_ make do with a convention of polling for "entryname" and
"workingon-entryname" and if "workingon-entryname" is set to 1, you
can expect entryname to be filled real soon now. However, memcached is
completely memorybound, so it is only nice for really small stuff or
for a large server farm which has gobs of spare ram.

(Note that memcached does have timeouts which means that the
'workingon' value could have a short timeout in case the request is
cancelled or the process dies - the nasty bit in the above plan would
be the polling.)

I still don't understand why apache doesn't do it. I guess it wants to be
stateless or something.

Apache doesn't do it because most web applications don't use the HTTP
procol correctly - specially when it comes to the idempotency of GET.
So in 99% of the cases, web apps serve truly different pages for the
same GET request, depending on your cookie, IP address, time-of-day,
etc.

Most websites deal with very little traffic, so this isn't a problem.
And many large sites that serve a lot of traffic from a dynamic web
app want to be serving custom ads, let you login and see your
personalised toolbar, etc,etc, so this wouldn't work for them either.

So in practice, serialising speculatively on GET requests for the same
URL has very little payoff except for static content. And that's quite
fast anyway.... specially if the underlying OS is smokin' fast ;-)

cheers,

martin
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html