Re: [PATCH] gitweb: use Git.pm, and use its parse_rev method for git_get_head_hash

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 2 Jun 2008 23:32:27 +0200

On Mon, 2 June 2008, Petr Baudis wrote:
> On Mon, Jun 02, 2008 at 12:19:23AM +0200, Jakub Narebski wrote:
> > On Sat, 31 May 2008, Lea Wiemann wrote:
> > >
> > > We may need to have an explicitly implemented layer 0 (front-end 
> > > caching) in Gitweb for at least a subset of pages (like project pages), 
> > > but we'll see if that's necessary.
> > 
> > I think that front-end caching (HTML/RSS/Atom output caching) has sense
> > for large web pages (large size and large number of items), such as
> > projects list page if it is unpaginated (and perhaps even if it is
> > divided into pages, although I'm sure not for project search page),
> > commonly requested snapshots (if they are enabled), large trees like
> > SPECS directory with all the package *.spec files for distribution
> > repository, perhaps summary/feed for most popular projects. If (when)
> > syntax highlighting would got implemented, probably also caching
> > blob view (as CPU can become issue).
> > 
> > Front-end (HTML output) caching has the advantages of easy to calculate
> > strong ETag, and web server support for If-Match: and If-None-Match:
> > HTTP/1.1 headers.  You can easy support Range: request, needed for
> > resuming download (e.g. for downloading snapshots, if this feature is
> > enabled in gitweb).
> 
> Caching snapshots would definitely make sense, sure.

That reminds me to finally implement nicer (git-describe like)
[proposed] snapshot filenames.  For example for snapshot of state
given by some tag (snapshot of tagged release [1]), don't use
generic
  <project basename>-<40-chars sha1>.<suffix>
  git-b2a42f55bc419352b848751b0763b0a2d1198479.tar.gz
but
  <project basename>-<tag name>.<suffix>
  git-v1.5.5.3.tar.gz
(well, currently tags don't have 'snapshot' link, but this is easily
fixed).  What do you think about (ab)using 'fp' (file_parent)
parameter to pass proposed snapshot file name?

[1] Would it be good feature to add support for limiting snapshots
to snapshots only of tagged releases (which would be I guess more
important when gitweb caching gets implemented).

> > You can even compress the output, and serve it to clients which
> > support proper transparent compression (Content-Encoding).
> 
> What does this have to do with caching?

Well, only in a sense that with front-end caching (to choose if CPU
matters most) this can be done "for free", without incurring extra
CPU, at the cost of little more disk space.

Of course, if most clients understand (accept) Content-Encoding
(transfer encoding), you can store compressed output, with a little
CPU cost to decompress for non-conformant clients; this way frontend
caching can have cache size comparable to [parsed] data caching.

> > And of course it has the advantage of actually been written and tested
> > in work, in the case of kernel.org gitweb.  Although caching parsed
> > data was implemented in repo.or.cz gitweb, it was done only for
> > projects list view, and it is quite new and not so thoroughly tested
> >   http://article.gmane.org/gmane.comp.version-control.git/77469
> 
> This argument does have some value, but I don't think it matters too
> much, since as far as I understood, it is going to get largely
> reimplemented anyway.

What I'd like to see in a bit of time is some estimate how much time
would take implementing data caching almost from scratch (a bit of
code in repo's gitweb), compared to merging in kernel.org's gitweb
caching code...

> > It would be nice for front-end caching to have an option to use absolute
> > time for all time/dates, and to (optionally) not use adaptive
> > Content-Type...
> 
> I'd hate to have to do unnecessary compromises in order to get sensible
> caching.
> 
> Even in your excellent series on Gitweb caching series, I didn't spot
> any arguments that would put frontend caching in front of the
> intermediate data caching option; yes, it is the simplest solution
> implementation-wise, but also the least flexible one. My gut feeling is
> still to go with data caching instead of HTML caching.

As Lars Hjemli wrote in "[RFC/PATCH] gitweb: Paginate project list"
thread (unfortunately not all articles got to git mailing list)

  http://thread.gmane.org/gmane.comp.version-control.git/81838/focus=81875

  <quote>
  In cgit I've chosen "projectlist in a single file" and "cache html
  output". This makes it cheap (in terms of cpu and io) to both generate
  and serve the cached page (and the cache works for all pages).
  </quote>

  <quote>
  While I agree that caching search result output almost never makes
  sense, I think it's more important that cache hits requires minimal
  processing. This is why I've chosen to cache the final result instead
  of an intermediate state, but both solutions obviously got some pros
  and cons.
  </quote>

caching final output is important if you want to minimize processing
(CPU time).  I'd say also if you want to implement Range: for resumable
downloads (snapshots), because otherwise I think it would be quote hard
to do reasonably (with caching only data).

So I guess best solution would be mixed one: use output cache for large
or CPU intensive pages, use data caching to limit cache size and for
maximum flexibility (relative dates, sorting by columns: athough that
would be best solved using some DHTML/JavaScript, paginated output,
projects search etc.)

By the way, we have your (Petr 'Pasky' Baudis, based on repo.or.cz)
and John 'Warthog9' Hawley (based on kernel.org) statements that gitweb
performance is I/O bound, but I don't remember any hard data.

I have said wrongly that one can use 'fio' tool to check I/O
performance; it is not true, this tool can be used to test _filesystem_
by generating specified pattern of I/O load.  I don't know of any tool
which allow to measure if I/O is bottleneck for given application;
'iogrind' measures cold cache start, and it requires I think compiled
program, as it uses Valgrind.  You can measure CPU load, time to
response and memory usage using 'time' (running gitweb as script),
ApacheBench and top; you can measure latency using LatencyTOP.  

Is there some iotop tool, and can it be used to measure performance
bottlenecks of web scripts (web applications)?

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html