Re: [PATCH/RFC] gitweb: Restructure projects list generation

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 1 Mar 2011 02:01:55 +0100

On Sun, 27 Feb 2011, Jakub Narebski wrote:
> On Sun, 27 Feb 2011, Jonathan Nieder wrote:
> > Jakub Narebski wrote:

> > > +# this assumes that forks follow immediately forked projects:
> > > +# [ ..., 'repo.git', 'repo/fork.git', ...]; if not, set second
> > > +# parameter to true value to sort projects by path first.
> > > +sub filter_forks_from_projects_list {
> > > +	my ($projects, $need_sorting) = @_;
> > > +
> > > +	@$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects
> > > +		if ($need_sorting);
> > 
> > What happens in this scenario?
> > 
> > 	git.git
> > 	git.related.project.git
> > 	git/platforms.git
> 
> Thanks for catching this.
> 
> It looks like I have oversimplified the algorithm by requiring that 
> forked project directly precedes any of its forks (if they exists).
> I'd have to redo this part of patch.
> 
> > [...]
> > > @@ -4747,23 +4784,36 @@ sub fill_project_list_info {
> > >  		if (!defined $pr->{'owner'}) {
> > >  			$pr->{'owner'} = git_get_project_owner("$pr->{'path'}") || "";
> > >  		}
> > > -		if ($check_forks) {
> > > -			my $pname = $pr->{'path'};
> > > -			if (($pname =~ s/\.git$//) &&
> > > -			    ($pname !~ /\/$/) &&
> > > -			    (-d "$projectroot/$pname")) {
> > > -				$pr->{'forks'} = "-d $projectroot/$pname";

Let's examine what's we had, what this commit does, and what we can have.

Before this commit there were two implementations of detecting and 
filtering-out forks: one based only on directory structure in
fill_project_list_info (it detected forks, but didn't filter them out),
and one in git_get_projects_list, which was based on pathnames and was
run only for $projects_list being [index] file (it filtered-out forks).

This commit replaced both by filter_forks_from_projects_list, which
mainly detected forks based on path, with directory based detecting
if there _can_ be forks... but this solution is based on overly strict
and not true assumption that forks always directly follow forked 
project, which doesn't need to be true e.g. for scanning $projects_list
directory.

Hmmm... it looks for me like filtering out forks worked only for 
$projects_list being file; the code in fill_project_list_info() detected
which projects were forked, but didn't detect which projects were forks
and should be filtered.  The code inside $check_forks conditional that
used to be in git_projects_list_body() was about filtering out for second
time projects that are not forks of current project, for 'forks' action.

Now naive implementation of detecting forks, and also implementation used
to be used in git_get_projects_list() for $projects_list being file is
O(n^2) in number of projects.  With a bit of simplification it can be
O(<projects>*<forked projects>) + O(<projects>)*stat (cost of "-d <dir>").
Withe git.kernel.org having around 1,000 projects, and repo.or.cz having
around 4,000 projects not counting forks, O(n^2) is out of question.

Of those 4,000 projects around 300 (or 7%) has forks, according to 
repo.or.cz projects list page.  But assuming that percentage of forked
repositories remains constant with growth of number of repositories,
this is still quadratic cost... but perhaps acceptable.

We can use more advanced algorithm to find prefixes (to find forks);
for example using trie[1][2] to find forks costs O(<projects>*<depth>).
For git.kernel.org which has quite deep hierarchy of projects number
of path components (depth) is no more than 6.

Synthetic benchmark with 2000 repositories, 5% of which are forked,
with 5 forks per forked repository shows that trie-based solution is
around 25 times faster than original solution from git_get_projects_list
with %paths.

The disadvantage of this solution is that even seriously stripped-down
trie construction and lookup is around 2 pages of code... or using non-core
Tree::Trie module from CPAN.

[1]: http://en.wikipedia.org/wiki/Trie
[2]: http://p3rl.org/Tree::Trie
-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html