Re: [PATCH 00/18] Gitweb caching v8

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 10 Dec 2010 02:27:27 +0100

On Fri, 10 Dec 2010, J.H. wrote:
> On 12/09/2010 03:26 PM, Jakub Narebski wrote:

>> John, could you please in the future Cc me?  I am interested in gitweb
>> output caching development.  Thanks in advance.
> 
> Apologies, apparently screwed up on my git send-email line.  I'll get
> that right one of these eons.

Ah, I can understand this.

>> I guess that in the final submission (i.e. the one that is to be
>> merged in into git.git repository) those changes would be squashed in,
>> isn't it?
> 
> I have no objections to squashing the reversions into a single patch,
> just figured it was easier to break them out for the time being.

I guess that interdiff in comments would work as well, or almost as well...

>>> There is one additional script I've written that the Fedora folks are using,
>>> and that might be useful to include, which is an 'offline' cache file generator.
>>> It basically wraps gitweb.cgi and at the end moves the cache file into the right
>>> place.  The Fedora folks were finding it took hours to generate their front
>>> page, and that doing a background generation almost never completed (due to 
>>> process death).  This was a simple way to handle that.  If people would like
>>> I can add it in as an additional patch.
>> 
>> Are you detaching the background process?

Errr... what I meant here is that perhaps detaching background process
would make it not die, but I am guessing here.

> No, in fact I completely turn off forking (using the $cacheDoFork variable.)

BTW. what I don't like is your code forking indiscriminately even if it
is not needed (e.g. background cache generation is turned off).

> 
>> It would be nice to have it as separate patch.
> 
> I can add it easily enough.

It is only about caching most IO intensive page, i.e. projects_list page,
isn't it?  Why doesn't _it_ die, like background process?

> 
>>> v8:
>>> 	- Reverting several changes from Jakub's change set that make no sense
>>>                 - is_cacheable changed to always return true - nothing special about
>>>                   blame or blame_incremental as far as the caching engine is concerned
>> 
>> 'blame_incremental' is just another version of 'blame' view.  I have
>> disabled it when caching is enabled in my rewrite (you instead disabled
>> caching for 'blame_incremental' in your v7 and mine v7.x) because I
>> couldn't get it to work together with caching.  Did you check that it
>> works?
> 
> blame works fine, blame_incremental generates but doesn't..... ohhhh
> someone added ajaxy kinda stuff and doesn't mention it anywhere.

Errr... I thought that the 'incremental' part is self-explaining that
it is Ajax-y stuff.  Well, while commit is 4af819d (gitweb: Incremental
blame (using JavaScript), 2009-09-01), perhaps I should have added some
comment in the code.

> 
> Exciting.
> 
> blame_data needs to not get a 'generating...' page in all likelihood,
> generating a blame_incremental page, letting it load and then refreshing
> the whole thing gets me what I'm expecting.

Hmmm... I wonder why it didn't work for me at that time...

> 
> Is enough to mask.
> 
> Guess I'm looking at a v9 now.
> 
>> Besides, withou "tee"-ing, i.e. printing output as it is captured,
>> cached 'blame_data' means that 'blame_incremental' is not incremental,
>> and therefore it vanishes its advantage over 'blame'.

I mean here that with current state of caching 'blame_incremental' stops
to be incremental...

> There are only 2 ways to get to a blame_incremental page
> 
> 1) By going to a blame page and clicking on the incremental link in the nav
> 
> 2) By enabling it by default so when you click 'blame' it goes to
> incremental first.

  3) By having JavaScript add ';js=1' to all links, so clicking on
  'blame' link (with action set to 'blame') would result in 
  'blame_incremental' view.

> 
>> In the case data is in cache, then 'blame_inremental' doesn't have
>> advantage over 'blame' either.
> 
> Agreed, though it's easy enough to support in the caching engine,
> basically don't return 'Generating...' and wait for that data to cache.
> Not really an advantage except that your not waiting for the whole
> generation to get a page back at all.
> 
>>>                 - Reverted change to reset_output as
>>>                         open STDOUT, ">&", \*STDOUT_REAL;
>>>                   causes assertion failures:
>>>                   Assertion !((((s->var)->sv_flags & (0x00004000|0x00008000)) == 0x00008000) && (((svtype)((s->var)->sv_flags & 0xff)) == SVt_PVGV || ((svtype)((s->var)->sv_flags & 0xff)) == SVt_PVLV)) failed: file "scalar.xs", line 49 at gitweb.cgi line 1221.
>>>                   if we encounter an error *BEFORE* we've ever changed the output.
>> 
>> Which Perl version are you using?  Because I think you found error in Perl.
>> Well, at least I have not happen on this bug.
> 
> This is perl, v5.10.0 built for x86_64-linux-thread-multi

Could you check with newer perl?  I don't get this error.

>> I have nothing againts using
>> 
>>   open STDOUT, ">&STDOUT_REAL";
>> 
>> though I really prefer that you used lexical filehandles, instead of
>> "globs" which are global variables.

And using 'print STDOUT_REAL "";' protects against spurious warning
(the warning is really wrong in this case).

>> The following works:
>> 
>>   open STDOUT, '>&', fileno($fh);
>> 
>> Note that fileno(Symbol::qualify_to_ref($fh)) might be needed...
> 
> I see 0 advantage to shifting around STDOUT and STDERR to a lexical
> filehandle vs. a glob in this case.  STDOUT_REAL retains all the
> properties of STDOUT should it be needed elsewhere, including what it
> was going and what it was doing.
> 
> I have no objection to shifting the file handles I'm using to lexical
> variables, if nothing else the argument about them closing when falling
> out of scope is worth it, but for STDOUT, STDERR, etc I don't think
> switching to lexicals makes a lot of sense

Well... I'd have to agree that in current case (capturing engine embedded
in gitweb, and gitweb-specific; no need for recursive capture) it would
be enough to use such globs.

> 
>>>         - Cleanups there were indirectly mentioned by Jakub
>>>                 - Elimination of anything even remotely looking like duplicate code
>>>                         - Creation of isBinaryAction() and isFeedAction()
>> 
>> Could you please do not use mixedCase names?
> 
> I'm fine with renaming those if you wish.
> 
>> First, that is what %actions_info from
>> 
>>   [PATCH 16/24] gitweb: Introduce %actions_info, gathering information about actions
>>   http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163038
>>   http://repo.or.cz/w/git/jnareb-git.git/commitdiff/305a10339b33d56b4a50708d71e8f42453c8cb1f
>> 
>> I have invented for.
> 
> I have not based any of my caching engine, right now, on anything you've
> done for your rewrite.

What I meant here that if you will be doing yet another version, you
can take a look at it as a way to avoiding not very clear and nice
long alternatives in condition, or in regexp matched.

> 
>> Second, why 'isBinaryAction()'?  there isn't something inherently
>> different between binary (':raw') and text (':utf8') output, as I have
>> repeatedly said before.
> 
> It's a binary action in that you are shoving something down the pipe
> with the intention of sending the bits completely raw.  You read the
> data raw, and write the data raw.  There is no interpretation of the
> data as being anything but straight raw.
> 
> Right now, in gitweb already, there are two places that treat output
> completely differently:
> 
> 	- snapshot
> 	- blob_plain
> 
> The only reason isBinaryAction() (or any other function name or process
> you want to grant it) exists is so that I can figure out if it's one of
> those actions so I can deal with the cache and output handling
> differently for each.
> 
> Yes, I could flip the entire caching engine over to following the same
> mantra for everything and thus there is no need to care, but gitweb
> itself isn't really setup to handle that separation cleanly right now,
> and I'm trying to make as few bigger changes right now as is.

Always reading from cache in ':raw' mode and always printing from cache
in ':raw' mode (i.e. setting STDOUT to ':raw' before printing / copying
cache entry) would be in gitweb case enough to not special-case binary
files.

In gitweb you always do "binmode STDOUT, ':raw';" _after_ starting capture,
which means that it gets applied to cache file; and gitweb always do
"binmode STDOUT, ':utf8';" before stopping capture.

If you print text data to file using ':utf8' layer (applied at beginning
to cache file) it is in this file as correct sequence of bytes.  Therefore
you can dump said cache file to STDOUT in ':raw' mode (or in ':utf8' mode)
- both STDOUT and read cache file has to have the same mode.

>>>         - Added better error handling
>>>                 - Creation of .err file in the cache directory
>>>                 - Trap STDERR output into $output_err as this was spewing data prior
>>>                   to any header information being sent
>> 
>> Why it is needed?  We capture output of "die" via CGI::Util::set_message,
>> and "warn" output is captured to web server logs... unless you explicitely
>> use "print STDERR <sth>" -- don't do that instead.
> 
> I have seen, in several instances, a case where git itself will generate
> an error, it shoves it to STDERR which makes it to the client before
> anything else, thus causing 500 level errors.
> 
> Added this so that STDERR got trapped and those messages didn't make it out.

Could you give examples when it happens?  Anything that happens after
"use CGI::Carp" is parsed should have STDERR redirected to web server
errors log.

I'll read the actual patch and comment on it.

> 
>>>         - Added hidden field in footer for url & hash of url, which is extremely useful
>>>           for debugging
>> 
>> Nice idea, I'll see it.  Can it be disabled (information leakage)?
> 
> There's not really any information leakage per-se, unless you call
> md5suming the url information leakage.

Ah, sorry, I send this comment before actually reading patch in question.

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html