"Georgios Kontaxis via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: [note to other reviewers. input from those who are more familiar with gitweb and Perl is very much appreciated on this patch]. > From: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> > > Gitweb extracts content from the Git log and makes it accessible > over HTTP. As a result, e-mail addresses found in commits are > exposed to web crawlers and they may not respect robots.txt. > This may result in unsolicited messages. "... are exposed to web crawlers, which spammers may use." would be sufficient as a problem description. After giving a problem description, it is customery to describe the solution as if you are ordering the codebase to "be like so", so instead of this ... > This is a feature for redacting e-mail addresses > from the generated HTML, etc. content. ... we may say something like Introduce an 'email-privacy' feature, which defaults to false, that redacts e-mail addresses that appear as author/committer info and in log messages from the generated HTML content. > This feature does not prevent someone from downloading the > unredacted commit log, e.g., by cloning the repository, and > extracting information from it. > It aims to hinder the low-effort bulk collection of e-mail > addresses by web crawlers. And this is a good thing to add. Overall, nicely written. > Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> > --- > gitweb: redacted e-mail addresses feature. > > Gitweb extracts content from the Git log and makes it accessible over > HTTP. As a result, e-mail addresses found in commits are exposed to web > crawlers and they may not respect robots.txt. This may result in > unsolicited messages. This is a feature for redacting e-mail addresses > from the generated HTML, etc. content. > > This feature does not prevent someone from downloading the unredacted > commit log, e.g., by cloning the repository, and extracting information > from it. It aims to hinder the low-effort bulk collection of e-mail > addresses by web crawlers. You do not need to repeat the above, which is in the log message above. > Changes since v1: > > * Turned off the feature by default. > * Removed duplicate code. > * Added note about Gitweb consumers receiving redacted logs. > > Changes since v2: > > * The feature can be set on a per-project basis. ('override' => 1) > > Changes since v3: > > * Renamed feature to "email-privacy" and improved documentation. > * Removed UI elements for git-format-patch since it won't be redacted. > * Simplified calls to the address redaction logic. > * Mail::Address is now used to reduce false-positive redactions. Having these under the --- line like this is very helpful. > diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl > index 0959a782eccb..6630c76d92fd 100755 > --- a/gitweb/gitweb.perl > +++ b/gitweb/gitweb.perl > @@ -21,6 +21,7 @@ > use File::Basename qw(basename); > use Time::HiRes qw(gettimeofday tv_interval); > use Digest::MD5 qw(md5_hex); > +use Git::LoadCPAN::Mail::Address; I'll defer to others who are more familiar with gitweb and Perl ecosystem if this is warranted, but I have a feeling that importing and using Mail::Address->parse() only because we want to see if a given "<string>" is an address is a bit overkill and it might be sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i > + # Redact e-mail addresses. > + > + # To enable system wide have in $GITWEB_CONFIG > + # $feature{'email-privacy'}{'default'} = [1]; > + 'email-privacy' => { > + 'sub' => sub { feature_bool('email-privacy', @_) }, > + 'override' => 1, > + 'default' => [0]}, > ); > > sub gitweb_get_feature { > @@ -3449,6 +3459,32 @@ sub parse_date { > return %date; > } > > +sub is_mailaddr { > + my @addrs = Mail::Address->parse(shift); > + if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) { > + return 0; > + } > + return 1; > +} > + > +sub hide_mailaddrs_if_private { > + my $line = shift; > + return $line unless gitweb_check_feature('email-privacy'); > + while ($line =~ m/(<[^>]+>)/g) { > + my $match = $1; > + if (!is_mailaddr($match)) { > + next; > + } > + my $offset = pos $line; > + my $head = substr $line, 0, $offset - length($match); > + my $redaction = "<redacted>"; > + my $tail = substr $line, $offset; > + $line = $head . $redaction . $tail; > + pos $line = length($head) + length($redaction); Hmmmm, Perl suggestions from others? It looks quite strange to see that s/// operator is not used and replacement is done manually with byte position in a Perl script. > sub parse_tag { > my $tag_id = shift; > my %tag; > @@ -3465,7 +3501,7 @@ sub parse_tag { > } elsif ($line =~ m/^tag (.+)$/) { > $tag{'name'} = $1; > } elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) { > - $tag{'author'} = $1; > + $tag{'author'} = hide_mailaddrs_if_private($1); > $tag{'author_epoch'} = $2; > $tag{'author_tz'} = $3; > if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) { This (and the others that follow the same pattern) looks sensible. > @@ -7489,7 +7526,8 @@ sub git_log_generic { > -accesskey => "n", -title => "Alt-n"}, "next"); > } > my $patch_max = gitweb_get_feature('patches'); > - if ($patch_max && !defined $file_name) { > + if ($patch_max && !defined $file_name && > + !gitweb_check_feature('email-privacy')) { > if ($patch_max < 0 || @commitlist <= $patch_max) { > $paging_nav .= " ⋅ " . > $cgi->a({-href => href(action=>"patches", -replay=>1)}, > @@ -7550,7 +7588,8 @@ sub git_commit { > } @$parents ) . > ')'; > } > - if (gitweb_check_feature('patches') && @$parents <= 1) { > + if (gitweb_check_feature('patches') && @$parents <= 1 && > + !gitweb_check_feature('email-privacy')) { > $formats_nav .= " | " . > $cgi->a({-href => href(action=>"patch", -replay=>1)}, > "patch"); > @@ -7863,7 +7902,8 @@ sub git_commitdiff { > $formats_nav = > $cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)}, > "raw"); > - if ($patch_max && @{$co{'parents'}} <= 1) { > + if ($patch_max && @{$co{'parents'}} <= 1 && > + !gitweb_check_feature('email-privacy')) { > $formats_nav .= " | " . > $cgi->a({-href => href(action=>"patch", -replay=>1)}, > "patch"); I wouldn't have expected to see the above three hunks in the "patch" codepath. Rather, I was hoping that you'd do something like this at startup when you find out that the privacy feature is enabled: $feature{'patches'}{'default'} = [0] if gitweb_get_feature('email-privacy'); so that nothing related to the 'patches' has to be modified. That way, even if there were fourth place that can leak an e-mail address in the 'patch' codepath that above three hunks in this patch missed, crawlers won't be able to get at it, no? > diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh > index 1f32ca66ea51..77fc1298d4c6 100644 > --- a/t/lib-gitweb.sh > +++ b/t/lib-gitweb.sh > @@ -67,6 +67,9 @@ gitweb_run () { > GITWEB_CONFIG=$(pwd)/gitweb_config.perl > export GITWEB_CONFIG > > + PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN" > + export PERL5LIB > + Why is this change suddenly become necessary? This addition is only about tests---do we need to do something similar in the runtime environment when the updated gitweb that requires Mail::Address gets deployed, or is that covered already somewhere else? Thanks.