Re: [RFC/PATCH] gitweb: Paginate project list

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 13 May 2008 08:55:07 +0200

Dnia poniedziałek 12. maja 2008 17:43, Lars Hjemli napisał:
> On 5/12/08, Jakub Narebski <jnareb@xxxxxxxxx> wrote:
>>  Dnia niedziela 11. maja 2008 08:56, Lars Hjemli napisał:
>>
>>> It seems to me that "projectlist in a single file" and "cache results
>>> of filled in @$projlist" are different solutions to the same problem:
>>> rapidly filling a perl datastructure.
>>
>> Well, yes and no.  "Projectlist in single file" is about _static_ data
>> (which changes only if projects are added, deleted, its description
>> changed; those are usually rare events), and avoiding mainly I/O and
>> not CPU (scanning filesystem for repositories, reading config and
>> description, etc.).
>>
>> "Cache data" is about caching _variable_ data, such as "Last changed"
>> information for project.  Caching data instead of caching output
>> (caching HTML) allows to share cache for different presentation of
>> the very same data (e.g. 'history'/'shortlog' vs 'rss').  And for some
>> pages, like project search results, caching HTML output doesn't make
>> much sense, while caching data has it.
> 
> While I agree that caching search result output almost never makes
> sense, I think it's more important that cache hits requires minimal
> processing. This is why I've chosen to cache the final result instead
> of an intermediate state, but both solutions obviously got some pros
> and cons.

True.  In most cases caching final output is enough.  Only in some
cases caching data is better solution.  I hope that "Gitweb caching"
Git's Google Summer of Code 2008 project would examine this in more
detail.

But please take into account that gitweb performance, and I guess any
git web interface performance, is I/O bound and not CPU bound (at least
according to what I remember from J.H. emails).  So a little more
processing is I think less important than avoiding hitting the repos.

J.H. (kernel.org) gitweb from what I remember does adaptive caching
of HTML output, while Pasky (repo.or.cz) gitweb does data caching only
for projects list page.

>>> This used to be expensive in terms of cache size (similar to k.orgs
>>> 20G), but current cgit solves this by treating the cache as a hash
>>> table; cgitrc has an option to set the cache size (number of files),
>>> each filename is generated as `hash(url) % cachesize` and each file
>>> contains the full url (to detect hash collisions) followed by the
>>> cached content for that url (see
>>> http://hjemli.net/git/cgit/tree/cache.c for the details).
>>
>>
>> I guess that is the simplest solution, but I don't think that is
>> the best solution to have size-limited cache.  For example CPAN Perl
>> module Cache::SizeAwareCache and its derivatives use the following
>> algorithm
>>
>>   The default cache size limiting algorithm works by removing cache
>>   objects in the following order until the desired limit is reached:
>>
>>     1) objects that have expired
>>     2) objects that are least recently accessed
>>     3) objects that that expire next
> 
> Again, minimal processing is the goal of cgits cache implementation,
> hence the simple solution.

I would really like if some comp-sci could calculate amortized cost
of this solution, and what I think is more important, cost of worst
case and what is the probability of hitting worst case or next to
worst case.

By the way, you have to take into account the time it takes to
calculate hash when comparing performance.  Note that for LRU cache
you can use heap / priority queue, or splice / self organizing binary
tree.

>>> Btw: gitweb and cgit seems to aquire the same features these days:
>>> cgit recently got pagination + search on the project list.
>>
>> I haven't checked what features cgit has lately...
>>
>> Gitweb development seems a bit stalled; I got no response to latest
>> turn od gitweb TODO and wishlist list...

...so you would have to turn for example to git-php, gitorious and
github for inspiration.

> Well, I for one found the wishlist interesting; I've been pondering on
> implementing a graphic log in cgit (inspired by git-forest and
> git-graph), but I refuse to perform a  topo-sort ;-)
> 
> Hopefully I can exploit the fact that cgit never uses more than one
> commit as starting point for log traversal, combined with heuristics
> on commit date, to enable a fast graphic log that will be correct for
> all but the most pathological cases.

I think if you wait for graphing API to make it into released version,
you (well, cgit) would be able to use it for "fast graphic log".

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html