> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@xxxxxxxxx> > writes: > > [note to other reviewers. input from those who are more familiar > with gitweb and Perl is very much appreciated on this patch]. > >> From: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> >> Gitweb extracts content from the Git log and makes it accessible >> over HTTP. As a result, e-mail addresses found in commits are >> exposed to web crawlers and they may not respect robots.txt. >> This may result in unsolicited messages. > > "... are exposed to web crawlers, which spammers may use." would be > sufficient as a problem description. > > After giving a problem description, it is customery to describe the > solution as if you are ordering the codebase to "be like so", so > instead of this ... > >> This is a feature for redacting e-mail addresses >> from the generated HTML, etc. content. > > ... we may say something like > > Introduce an 'email-privacy' feature, which defaults to false, > that redacts e-mail addresses that appear as author/committer > info and in log messages from the generated HTML content. > >> This feature does not prevent someone from downloading the >> unredacted commit log, e.g., by cloning the repository, and >> extracting information from it. >> It aims to hinder the low-effort bulk collection of e-mail >> addresses by web crawlers. > > And this is a good thing to add. Overall, nicely written. > >> Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> --- >> gitweb: redacted e-mail addresses feature. >> >> Gitweb extracts content from the Git log and makes it accessible >> over >> HTTP. As a result, e-mail addresses found in commits are exposed to >> web >> crawlers and they may not respect robots.txt. This may result in >> unsolicited messages. This is a feature for redacting e-mail >> addresses >> from the generated HTML, etc. content. >> >> This feature does not prevent someone from downloading the >> unredacted >> commit log, e.g., by cloning the repository, and extracting >> information >> from it. It aims to hinder the low-effort bulk collection of e-mail >> addresses by web crawlers. > > You do not need to repeat the above, which is in the log message above. > >> Changes since v1: >> >> * Turned off the feature by default. >> * Removed duplicate code. >> * Added note about Gitweb consumers receiving redacted logs. >> >> Changes since v2: >> >> * The feature can be set on a per-project basis. ('override' => 1) >> >> Changes since v3: >> >> * Renamed feature to "email-privacy" and improved documentation. >> * Removed UI elements for git-format-patch since it won't be >> redacted. >> * Simplified calls to the address redaction logic. >> * Mail::Address is now used to reduce false-positive redactions. > > Having these under the --- line like this is very helpful. > >> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl >> index 0959a782eccb..6630c76d92fd 100755 >> --- a/gitweb/gitweb.perl >> +++ b/gitweb/gitweb.perl >> @@ -21,6 +21,7 @@ >> use File::Basename qw(basename); >> use Time::HiRes qw(gettimeofday tv_interval); >> use Digest::MD5 qw(md5_hex); >> +use Git::LoadCPAN::Mail::Address; > > I'll defer to others who are more familiar with gitweb and Perl > ecosystem if this is warranted, but I have a feeling that importing > and using Mail::Address->parse() only because we want to see if a > given "<string>" is an address is a bit overkill and it might be > sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i > >> + # Redact e-mail addresses. >> + >> + # To enable system wide have in $GITWEB_CONFIG >> + # $feature{'email-privacy'}{'default'} = [1]; >> + 'email-privacy' => { >> + 'sub' => sub { feature_bool('email-privacy', @_) }, >> + 'override' => 1, >> + 'default' => [0]}, >> ); >> >> sub gitweb_get_feature { >> @@ -3449,6 +3459,32 @@ sub parse_date { >> return %date; >> } >> >> +sub is_mailaddr { >> + my @addrs = Mail::Address->parse(shift); >> + if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) { >> + return 0; >> + } >> + return 1; >> +} >> + >> +sub hide_mailaddrs_if_private { >> + my $line = shift; >> + return $line unless gitweb_check_feature('email-privacy'); >> + while ($line =~ m/(<[^>]+>)/g) { >> + my $match = $1; >> + if (!is_mailaddr($match)) { >> + next; >> + } >> + my $offset = pos $line; >> + my $head = substr $line, 0, $offset - length($match); >> + my $redaction = "<redacted>"; >> + my $tail = substr $line, $offset; >> + $line = $head . $redaction . $tail; >> + pos $line = length($head) + length($redaction); > > Hmmmm, Perl suggestions from others? It looks quite strange to see > that s/// operator is not used and replacement is done manually with > byte position in a Perl script. > >> sub parse_tag { >> my $tag_id = shift; >> my %tag; >> @@ -3465,7 +3501,7 @@ sub parse_tag { >> } elsif ($line =~ m/^tag (.+)$/) { >> $tag{'name'} = $1; >> } elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) { >> - $tag{'author'} = $1; >> + $tag{'author'} = hide_mailaddrs_if_private($1); >> $tag{'author_epoch'} = $2; >> $tag{'author_tz'} = $3; >> if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) { > > This (and the others that follow the same pattern) looks sensible. > >> @@ -7489,7 +7526,8 @@ sub git_log_generic { >> -accesskey => "n", -title => "Alt-n"}, "next"); >> } >> my $patch_max = gitweb_get_feature('patches'); >> - if ($patch_max && !defined $file_name) { >> + if ($patch_max && !defined $file_name && >> + !gitweb_check_feature('email-privacy')) { >> if ($patch_max < 0 || @commitlist <= $patch_max) { >> $paging_nav .= " ⋅ " . >> $cgi->a({-href => href(action=>"patches", -replay=>1)}, >> @@ -7550,7 +7588,8 @@ sub git_commit { >> } @$parents ) . >> ')'; >> } >> - if (gitweb_check_feature('patches') && @$parents <= 1) { >> + if (gitweb_check_feature('patches') && @$parents <= 1 && >> + !gitweb_check_feature('email-privacy')) { >> $formats_nav .= " | " . >> $cgi->a({-href => href(action=>"patch", -replay=>1)}, >> "patch"); >> @@ -7863,7 +7902,8 @@ sub git_commitdiff { >> $formats_nav = >> $cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)}, >> "raw"); >> - if ($patch_max && @{$co{'parents'}} <= 1) { >> + if ($patch_max && @{$co{'parents'}} <= 1 && >> + !gitweb_check_feature('email-privacy')) { >> $formats_nav .= " | " . >> $cgi->a({-href => href(action=>"patch", -replay=>1)}, >> "patch"); > > I wouldn't have expected to see the above three hunks in the > "patch" codepath. Rather, I was hoping that you'd do something > like this at startup when you find out that the privacy feature > is enabled: > > $feature{'patches'}{'default'} = [0] > if gitweb_get_feature('email-privacy'); > > so that nothing related to the 'patches' has to be modified. > That way, even if there were fourth place that can leak an e-mail > address in the 'patch' codepath that above three hunks in this patch > missed, crawlers won't be able to get at it, no? > I forgot to mention one of the reasons for doing it this way. If someone knows this URL exists and is relying on it (e.g., in a script) they can still call it with "email-privacy" enabled. The only thing that's changing is that the UI element (link) that a web crawler would follow just because it's there is now gone. But, as mentioned before, I don't feel strongly about either approach. Let me know. >> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh >> index 1f32ca66ea51..77fc1298d4c6 100644 >> --- a/t/lib-gitweb.sh >> +++ b/t/lib-gitweb.sh >> @@ -67,6 +67,9 @@ gitweb_run () { >> GITWEB_CONFIG=$(pwd)/gitweb_config.perl >> export GITWEB_CONFIG >> >> + PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN" >> + export PERL5LIB >> + > > Why is this change suddenly become necessary? This addition is only > about tests---do we need to do something similar in the runtime > environment when the updated gitweb that requires Mail::Address gets > deployed, or is that covered already somewhere else? > > Thanks. >