On Sun, 27 Feb 2011, Jakub Narebski wrote: > On Sun, 27 Feb 2011, Jonathan Nieder wrote: > > Jakub Narebski wrote: > > > +# this assumes that forks follow immediately forked projects: > > > +# [ ..., 'repo.git', 'repo/fork.git', ...]; if not, set second > > > +# parameter to true value to sort projects by path first. > > > +sub filter_forks_from_projects_list { > > > + my ($projects, $need_sorting) = @_; > > > + > > > + @$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects > > > + if ($need_sorting); > > > > What happens in this scenario? > > > > git.git > > git.related.project.git > > git/platforms.git > > Thanks for catching this. > > It looks like I have oversimplified the algorithm by requiring that > forked project directly precedes any of its forks (if they exists). > I'd have to redo this part of patch. > > > [...] > > > @@ -4747,23 +4784,36 @@ sub fill_project_list_info { > > > if (!defined $pr->{'owner'}) { > > > $pr->{'owner'} = git_get_project_owner("$pr->{'path'}") || ""; > > > } > > > - if ($check_forks) { > > > - my $pname = $pr->{'path'}; > > > - if (($pname =~ s/\.git$//) && > > > - ($pname !~ /\/$/) && > > > - (-d "$projectroot/$pname")) { > > > - $pr->{'forks'} = "-d $projectroot/$pname"; Let's examine what's we had, what this commit does, and what we can have. Before this commit there were two implementations of detecting and filtering-out forks: one based only on directory structure in fill_project_list_info (it detected forks, but didn't filter them out), and one in git_get_projects_list, which was based on pathnames and was run only for $projects_list being [index] file (it filtered-out forks). This commit replaced both by filter_forks_from_projects_list, which mainly detected forks based on path, with directory based detecting if there _can_ be forks... but this solution is based on overly strict and not true assumption that forks always directly follow forked project, which doesn't need to be true e.g. for scanning $projects_list directory. Hmmm... it looks for me like filtering out forks worked only for $projects_list being file; the code in fill_project_list_info() detected which projects were forked, but didn't detect which projects were forks and should be filtered. The code inside $check_forks conditional that used to be in git_projects_list_body() was about filtering out for second time projects that are not forks of current project, for 'forks' action. Now naive implementation of detecting forks, and also implementation used to be used in git_get_projects_list() for $projects_list being file is O(n^2) in number of projects. With a bit of simplification it can be O(<projects>*<forked projects>) + O(<projects>)*stat (cost of "-d <dir>"). Withe git.kernel.org having around 1,000 projects, and repo.or.cz having around 4,000 projects not counting forks, O(n^2) is out of question. Of those 4,000 projects around 300 (or 7%) has forks, according to repo.or.cz projects list page. But assuming that percentage of forked repositories remains constant with growth of number of repositories, this is still quadratic cost... but perhaps acceptable. We can use more advanced algorithm to find prefixes (to find forks); for example using trie[1][2] to find forks costs O(<projects>*<depth>). For git.kernel.org which has quite deep hierarchy of projects number of path components (depth) is no more than 6. Synthetic benchmark with 2000 repositories, 5% of which are forked, with 5 forks per forked repository shows that trie-based solution is around 25 times faster than original solution from git_get_projects_list with %paths. The disadvantage of this solution is that even seriously stripped-down trie construction and lookup is around 2 pages of code... or using non-core Tree::Trie module from CPAN. [1]: http://en.wikipedia.org/wiki/Trie [2]: http://p3rl.org/Tree::Trie -- Jakub Narebski Poland -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html