Re: kernel.org mirroring (Re: [GIT PULL] MMC update)

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 8 Dec 2006 14:38:24 +0100

Dnia piątek 8. grudnia 2006 13:57, Rogan Dawes napisał:
> H. Peter Anvin wrote:
>> Olivier Galibert wrote:
>>> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
>>>> Unfortunately, the most common queries are also extremely expensive.

With newer gitweb, which tries to do the same using less git commands,
some of queries (summary, heads, tags pages) should be less expensive.

>>> Do you have a top-ten of queries ?  That would be the ones to optimize
>>> for.
>> 
>> The front page, summary page of each project, and the RSS feed for each 
>> project.
> 
> How about extending gitweb to check to see if there already exists a 
> cached version of these pages, before recreating them?
> 
> e.g. structure the temp dir in such a way that each project has a place 
> for cached pages. Then, before performing expensive operations, check to 
> see if a file corresponding to the requested page already exists. If it 
> does, simply return the contents of the file, otherwise go ahead and 
> create the page dynamically, and return it to the user. Do not create 
> cached pages in gitweb dynamically.

This would add the need for directory for temporary files... well,
it would be optional now...

> Then, in a post-update hook, for each of the expensive pages, invoke 
> something like:
> 
> # delete the cached copy of the file, to force gitweb to recreate it
> rm -f $git_temp/$project/rss
> # get gitweb to recreate the page appropriately
> # use a tmp file to prevent gitweb from getting confused
> wget -O $git_temp/$project/rss.tmp \
>    http://kernel.org/gitweb.cgi?p=$project;a=rss
> # move the tmp file into place
> mv $git_temp/$project/rss.tmp $git_temp/$project/rss

Good idea... although there are some page views which shouldn't change
at all... well, with the possible exception of changes in gitweb output,
and even then there are some (blob_plain and snapshot views) which
doesn't change at all.

It would be good to avoid removing them on push, and only remove
them using some tmpwatch-like removal.

> This way, we get the exact output returned from the usual gitweb 
> invocation, but we can now cache the result, and only update it when 
> there is a new commit that would affect the page output.
> 
> This would also not affect those who do not wish to use this mechanism. 
> If the file does not exist, gitweb.cgi will simply revert to its usual 
> behaviour.

Good idea. Perhaps I should add it to gitweb TODO file.

Hmmm... perhaps it is time for next "[RFC] gitweb wishlist and TODO list"
thread?

> Possible complications are the content-type headers, etc, but you could 
> use the -s flag to wget, and store the server headers as well in the 
> file, and get the necessary headers from the file as you stream it.
> 
> i.e. read the headers looking for ones that are "interesting" 
> (Content-Type, charset, expires) until you get a blank line, print out 
> the interesting headers using $cgi->header(), then just dump the 
> remainder of the file to the caller via stdout.

No need for that. $cgi->header() is to _generate_ the headers, so if
a file is saved with headers, we can just dump it to STDOUT; the possible
exception is a need to rewrite 'expires' header, if it is used.

Perhaps gitweb should generate it's own ETag instead of messing with
'expires' header?
-- 
Jakub Narebski
Poland
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html