Re: [PATCH 1/6 (v2)] gitweb: Restructure projects list generation

Jakub Narebski <jnareb@xxxxxxxxx> · Sat, 7 May 2011 20:39:22 +0200

On Fri, 29 Apr 2011, Jakub Narebski wrote:

> Extract filtering out forks (which is done if 'forks' feature is
> enabled) into filter_forks_from_projects_list subroutine, and
> searching projects (via projects search form, or via content tags)
> into search_projects_list subroutine.

If it was all this patch did, it would be probably easier to review.
Should I then split this patch into smaller, though not entirely
standalone patches: pure refactoring, improving finding forks, and
adding tests?

> Sorting projects got also refactored in a very straight way (just
> moving code) into sort_projects_list subroutine.

This also could be refactored in a separate patch... though I rather
have all refactorings together, so one can see the result: clean code.

> The interaction between forks, content tags and searching is now made
> more explicit: searching whether by tag, or via search form turns off
> fork filtering (gitweb searches also forks, and will show all
> results).  If 'ctags' feature is disabled, then searching by tag is
> too.

This is functional change that makes code more clear, and I think it
makes better behavior... but it changes how gitweb behaves, so it is
not pure refactoring.  Should I put it into a bit artificial separate
patch?

> While at it we now detect that there are no projects and respond with
> "404 No projects found" also for 'project_index' and 'opml' actions.

This "while at it" is here because we can now say "No projects found"
for 'projects_list' action, because filtering / searching is done
upfront, so we know if we found anything.

But again, this perhaps would be better put in a separate patch.

[...]
> filter_forks_from_projects_list() uses trie[1] (prefix tree) to find
> which repositories are forks of other projects; directories are used
> as "letters" in trie.  Algorithm to create trie and find prefix match
> was created with some help of code of Tree::Tree[2] Perl module from
> CPAN.
> 
> [1]: http://en.wikipedia.org/wiki/Trie
> [2]: http://p3rl.org/Tree::Trie

This change in algorithm should perhaps be put in a separate patch...

> Note that before this commit filtering out forks worked only for
> $projects_list being a file, it assumed that forks are always after
> forked project (not very unreasonable), and used algorithm which is
> O(<projects>^2).
> 
> Current code filters out projects only where it should, regardless
> whether $projects_list is directory or a file, and trie-based
> algorithm is O(<projects>) + O(<projects forked>*<max depth>).
> Current code also consistently finds shortest prefix (important in
> counting forks in case of forks of forks).

Jakub Narebski wrote in
"Re: What's cooking in git.git (May 2011, #02; Wed, 4)"
http://thread.gmane.org/gmane.comp.version-control.git/172789/focus=172820

> Junio C Hamano <gitster@xxxxxxxxx> writes:
[...]
>> * jn/ctags (2011-04-29) 6 commits
>>  - gitweb: Optional grouping of projects by category
>>  - gitweb: Modularized git_get_project_description to be more generic
>>  - gitweb: Split git_project_list_body in two functions
>>  - gitweb: Mark matched 'ctag' / contents tag (?by_tag=foo)
>>  - gitweb: Change the way "content tags" ('ctags') are handled
>>  - gitweb: Restructure projects list generation
>> 
>> Waiting for comments.
> 
> Should I do and post benchmarks for
> 
>    - gitweb: Restructure projects list generation
> 
> change (when 'forks' feature is used)?

It is not easy to compare old and new code, because those algorithm
have different behavior.

Best I can come up with is comparing pure fork detection (not the time
it takes to generate page) - I can easy mock that - for sizes comparable
with http://repo.or.cz which uses 'forks' feature.  repo.or.cz has
around 4,000 projects, around 300 of those have forks (note: forks are
not included in number of projects).  From non-representative sample
most forked project have but single fork; note however that there are
repositories such as git.git, which has 55 forks, and some of those have
forks on its own (forks of forks).

 plan: nprojects=4000; nprojects_forked=300; nforks_per_forked_project=1
 number of projects: 4000

 Benchmark: running old, new for at least 600 CPU seconds...
        old: 1231.38 wallclock secs (620.23 usr +  1.30 sys = 621.53 CPU) @  0.03/s (n=21)
        new:  847.6  wallclock secs (632.34 usr +  0.80 sys = 633.14 CPU) @  5.18/s (n=3277)
     s/iter    old    new
 old   29.6     --   -99%
 new  0.193 15219%     --

 Dumbbench: running old, new for at least 50 iterations...
 old: Ran 106 iterations (25 outliers).
 old: Rounded run time per iteration: 3.316e+01  +/- 1.5e-01 (0.5%)
 new: Ran 66 iterations (15 outliers).
 new: Rounded run time per iteration: 2.0213e-01 +/- 2.1e-04 (0.1%)

Below there is table of time versus number of projects, used to generate
attached and linked images.  Data was generated using Dumbbench module
from CPAN.

 ########################################################
 # nforks  = 5 (forks per project forked)
 # nforked = 5.00%
 # 1:projects 2:nforked  3:old 4:old_err  5:new 5:new_err
 #  0   0  8.959e-06  2.2e-08  1.0137e-05 2.0e-08
   50   3  3.211e-03  1.6e-05  2.0524e-03 5.0e-06
  103   5  1.4003e-02 6.7e-05  4.477e-03  2.2e-05
  204  10  5.647e-02  2.3e-04  9.464e-03  4.6e-05
  402  20  2.1939e-01 3.2e-04  1.9943e-02 9.6e-05
  500  25  3.666e-01  3.6e-03  2.706e-02  2.5e-04
  800  40  9.124e-01  1.7e-03  4.175e-02  2.1e-04
 1000  50  1.575e+00  1.3e-02  5.896e-02  5.3e-04
 1600  80  3.738e+00  1.2e-02  9.347e-02  4.3e-04
 2004 100  6.820e+00  6.7e-02  1.1216e-01 3.6e-04
 3203 160  1.8189e+01 8.5e-02  1.8144e-01 4.5e-04
 4003 200  2.927e+01  2.9e-01  2.2939e-01 5.7e-04

Script used to generate this data is attached.

From those data we can generate a plot to show how time depends on size
of problem (on total number of projects).  Assuming that both "old" ane
"new" are P i.e. are O(n^k), where 'n' is number of projects, this
dependency is best examined on loglog scale.

      y = x^k
  log y = log (x^k) = k * log x
      v = k * u,  where u = log x, v = log y

This plot, together with fit of cubic polynomial and linear function
to the data, is shown in

  https://git.wiki.kernel.org/index.php/File:Benchmark_find_forks-old_vs_new.png

(I'm sorry for abusing git wiki for sharing images...).
-- 
Jakub Narebski
Poland
#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw(first);
use POSIX qw(ceil);

use Acme::MetaSyntactic qw(any);
use Benchmark qw(:all :hireswallclock);
use Dumbbench;

use Data::Dumper::Concise; # for debugging

# ----------------------------------------------------------------------

# old: version used to be in git_get_projects_list for $projects_list file
sub filter__git_get_projects_list {
	my ($projects, $need_sorting) = @_;

	@$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects
		if ($need_sorting);

	my (@filtered, %paths);
 PROJECT:
	foreach my $pr (@$projects) {
		my $path = $pr->{'path'};

		# below is v1.7.5.1:gitweb/gitweb.perl lines 2725 and later
	PATH:
		foreach my $filter (keys %paths) {
			# looking for forks
			my $pfx = substr($path, 0, length($filter));
			if ($pfx ne $filter) {
				next PATH;
			}
			my $sfx = substr($path, length($filter));
			if ($sfx !~ /^\/.*\.git$/) {
				next PATH;
			}
			# is a fork, don't include it in
			# the list
			next PROJECT;
		}

		# [...]
		# below is v1.7.5.1:gitweb/gitweb.perl line 2746 and later
		push @filtered, $pr;
		(my $forks_path = $path) =~ s/\.git$//;
		$paths{$forks_path}++;
	}

	return @filtered;
}

# new: using 'path' only, using trie; code borrowed from Tree::Trie (Perl Artistic License)
sub filter_forks_from_projects_list {
	my ($projects, $need_sorting) = @_;

	@$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects
		if ($need_sorting);

	my %trie; # prefix tree of directories (path components)
	# generate trie out of those directories that might contain forks
	foreach my $pr (@$projects) {
		my $path = $pr->{'path'};
		$path =~ s/\.git$//;      # forks of 'repo.git' are in 'repo/' directory
		next if ($path =~ m!/$!); # skip non-bare repositories, e.g. 'repo/.git'
		next unless ($path);      # skip '.git' repository: tests, git-instaweb
		#next unless (-d $path);  # containing directory exists (NOT IN TESTS!!!)
		$pr->{'forks'} = [];      # there can be 0 or more forks of project

		# add to trie
		my @dirs = split('/', $path);
		# walk the trie, until either runs out of components or out of trie
		my $ref = \%trie;
		while (scalar @dirs &&
		       exists($ref->{$dirs[0]})) {
			$ref = $ref->{shift @dirs};
		}
		# create rest of trie structure from rest of components
		foreach my $dir (@dirs) {
			$ref = $ref->{$dir} = {};
		}
		# create end marker, store $pr as a data
		$ref->{''} = $pr if (!exists $ref->{''});
	}

	# filter out forks, by finding shortest prefix match for paths
	my @filtered;
 PROJECT:
	foreach my $pr (@$projects) {
		# trie lookup
		my $ref = \%trie;
	DIR:
		foreach my $dir (split('/', $pr->{'path'})) {
			if (exists $ref->{''}) {
				# found [shortest] prefix, is a fork - skip it
				push @{$ref->{''}{'forks'}}, $pr;
				next PROJECT;
			}
			if (!exists $ref->{$dir}) {
				# not in trie, cannot have prefix, not a fork
				push @filtered, $pr;
				next PROJECT;
			}
			# If the dir is there, we just walk one step down the trie.
			$ref = $ref->{$dir};
		}
		# we ran out of trie
		# (shouldn't happen: it's either no match, or end marker)
		push @filtered, $pr;
	}

	return @filtered;
}

# ----------------------------------------------------------------------

sub prepare_projects {
	my ($nprojects, $nforked, $nforks_per_project, $exclude_forks) = @_;
	my @projects;

	while ($nprojects > 0) {
		push @projects, metaname().".git";
		$nprojects--;
		last unless $nprojects;

		if (rand() < (1.0*$nforked/$nprojects)) {
			(my $base = $projects[-1]) =~ s/\.git$//;
			push @projects, map { "$base/$_.git" } metaname($nforks_per_project);
			$nprojects -= $nforks_per_project unless $exclude_forks;
			$nforked--;
		}
	}

	#print Dumper(\@projects);
	@projects = map { { 'path' => $_ } } @projects;

	return @projects;
}

# ######################################################################

my (@projects, @filtered);

# repo.or.cz
# * 4038 projects (without forks?), based on http://repo.or.cz/w?a=project_index
# * 4024 projects, based on "grep -c -e '<tr class="'" on http://repo.or.cz/w?a=project_list
# * 3428 active projects, based on "grep -c -e '"age'" on http://repo.or.cz/w?a=project_list
# *  308 projects have forks (7.6-10%), based on grep -c -e '/forks">+</a>' project_list.html
# * nonrepresentative sample of nforks: 6, 2, 1, 1, 0, 0, 1, 0, 1, 2, 3, 3, 3, 55+, 1, 1, 1

my $nprojects = 100;
my $nforked_frac = 0.05;
my $nforked   = ceil($nforked_frac*$nprojects); # 5%
my $nforks    = 5;

$nprojects = 4000;
$nforked   = 300;
$nforks    = 1;

print "plan: nprojects=$nprojects; nforked=$nforked; nforks=$nforks\n";
@projects = prepare_projects($nprojects, $nforked, $nforks);
print "number of projects: ".(scalar @projects)."\n";

print "\n";
my $results = timethese(-600, {
	'old' => sub { filter__git_get_projects_list(\@projects) },
	'new' => sub { filter_forks_from_projects_list(\@projects) },
});
#print Dumper($results);
cmpthese($results);
print "\n";

print "Dumbbench: old, new\n";
my $bench = Dumbbench->new(
	target_rel_precision => 0.005, # seek ~0.5%
	initial_runs         => 50,    # the higher the more reliable
);
$bench->add_instances(
	Dumbbench::Instance::PerlSub->new(
		name => 'old',
		code => sub { filter__git_get_projects_list(\@projects) },
	),
	Dumbbench::Instance::PerlSub->new(
		name => 'new',
		code => sub { filter_forks_from_projects_list(\@projects) },
	),
);
$bench->run();
$bench->report();

print "\n";
#print Dumper($bench->instances());
__END__

print "#####################################################################\n".
      "# nforks  = $nforks (forks per project forked)\n".
      "# nforked = ".(sprintf "%.2f%%", $nforked_frac*100.0)."\n".
      "# 1:projects 2:nforked  3:old 4:old_err  5:new 5:new_err\n";
for (my $i = 50; $i <= 4000; $i *= 2) {
	$nforked  = ceil($nforked_frac*$i);
	@projects = prepare_projects($i, $nforked, $nforks);

	$bench = Dumbbench->new(
		target_rel_precision => 0.005, # seek ~0.5%
		initial_runs         => 50,    # the higher the more reliable
	);
	$bench->add_instances(
		Dumbbench::Instance::PerlSub->new(
			name => 'old',
			code => sub { filter__git_get_projects_list(\@projects) },
		),
		Dumbbench::Instance::PerlSub->new(
			name => 'new',
			code => sub { filter_forks_from_projects_list(\@projects) },
		),
	);
	$bench->run();

	my %instances = map { $_->name() => $_ } $bench->instances();

	printf "%4d %3d ", (scalar @projects), $nforked;
	foreach my $name (qw(old new)) {
		my $result = $instances{$name}->result();
		print " ".$result->number();
		print " ".$result->error()->[0];
	}
	print "\n";
}
#print Dumper($bench->instances());
#print Dumper($_->result()) foreach ($bench->instances());

print "\n";

1;
__END__