> > On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote: > >> From: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> >> Gitweb extracts content from the Git log and makes it accessible >> over HTTP. As a result, e-mail addresses found in commits are >> exposed to web crawlers and they may not respect robots.txt. >> This may result in unsolicited messages. >> This is a feature for redacting e-mail addresses >> from the generated HTML, etc. content. >> >> This feature does not prevent someone from downloading the >> unredacted commit log, e.g., by cloning the repository, and >> extracting information from it. >> It aims to hinder the low-effort bulk collection of e-mail >> addresses by web crawlers. >> >> Changes since v1: >> - Turned off the feature by default. >> - Removed duplicate code. >> - Added note about Gitweb consumers receiving redacted logs. >> >> Changes since v2: >> - The feature can be set on a per-project basis. ('override' => 1) >> >> Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> --- >> gitweb: Redacted e-mail addresses feature. >> >> Gitweb extracts content from the Git log and makes it accessible >> over >> HTTP. As a result, e-mail addresses found in commits are exposed to >> web >> crawlers. This may result in unsolicited messages. This is a feature >> for >> redacting e-mail addresses from the generated HTML content. >> >> This feature does not prevent someone from downloading the >> unredacted >> commit log and extracting information from it. It aims to hinder the >> low-effort bulk collection of e-mail addresses by web crawlers. >> >> Signed-off-by: Georgios Kontaxis geko1702+commits@xxxxxxxxx >> >> Published-As: >> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3 >> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git >> pr-910/kontaxis/kontaxis/email_privacy-v3 >> Pull-Request: https://github.com/gitgitgadget/git/pull/910 >> >> Range-diff vs v2: >> >> 1: 74af11ca8bf2 ! 1: 930cdefe7ee0 gitweb: redacted e-mail addresses >> feature. >> @@ Commit message >> - Removed duplicate code. >> - Added note about Gitweb consumers receiving redacted logs. >> >> + Changes since v2: >> + - The feature can be set on a per-project basis. ('override' >> => 1) >> + >> Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> >> ## Documentation/gitweb.conf.txt ## >> @@ gitweb/gitweb.perl: sub evaluate_uri { >> + # $feature{'email_privacy'}{'default'} = [1]; >> + 'email_privacy' => { >> + 'sub' => sub { feature_bool('email_privacy', @_) }, >> -+ 'override' => 0, >> ++ 'override' => 1, >> + 'default' => [0]}, >> ); >> >> >> >> Documentation/gitweb.conf.txt | 16 +++++++++++++ >> gitweb/gitweb.perl | 42 ++++++++++++++++++++++++++++++++--- >> 2 files changed, 55 insertions(+), 3 deletions(-) >> >> diff --git a/Documentation/gitweb.conf.txt >> b/Documentation/gitweb.conf.txt >> index 7963a79ba98b..b7af3240177d 100644 >> --- a/Documentation/gitweb.conf.txt >> +++ b/Documentation/gitweb.conf.txt >> @@ -896,6 +896,22 @@ same as of the snippet above: >> It is an error to specify a ref that does not pass "git >> check-ref-format" >> scrutiny. Duplicated values are filtered. >> >> +email_privacy:: >> + Redact e-mail addresses from the generated HTML, etc. content. >> + This hides e-mail addresses found in the commit log from web >> crawlers. >> + Disabled by default. >> ++ >> +It is highly recommended to enable this feature unless web crawlers are >> +hindered in some other way. Note that crawlers intent on harvesting >> e-mail >> +addresses may disregard robots.txt. You can enable this feature like >> so: >> ++ >> +--------------------------------------------------------------------------- >> +$feature{'email_privacy'}{'default'} = [1]; >> +--------------------------------------------------------------------------- >> ++ >> +Note that if Gitweb is not the final step in a workflow then subsequent >> +steps may misbehave because of the redacted information they receive. >> + >> >> EXAMPLES >> -------- >> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl >> index 0959a782eccb..174cc566d97d 100755 >> --- a/gitweb/gitweb.perl >> +++ b/gitweb/gitweb.perl >> @@ -569,6 +569,15 @@ sub evaluate_uri { >> 'sub' => \&feature_extra_branch_refs, >> 'override' => 0, >> 'default' => []}, >> + >> + # Redact e-mail addresses. >> + >> + # To enable system wide have in $GITWEB_CONFIG >> + # $feature{'email_privacy'}{'default'} = [1]; >> + 'email_privacy' => { >> + 'sub' => sub { feature_bool('email_privacy', @_) }, >> + 'override' => 1, >> + 'default' => [0]}, >> ); >> >> sub gitweb_get_feature { >> @@ -3449,6 +3458,19 @@ sub parse_date { >> return %date; >> } >> >> [snip] > > So in the v1 feedback I suggested: > > BEGIN QUOTE > sub maybe_hide_email { > my $email = shift; > return $email unless gitweb_check_feature('email_privacy'); > return hide_email($email); > } > > then: > > $tag{author_email} = maybe_hide_email($2); > END QUOTE > > But: > >> sub parse_tag { >> my $tag_id = shift; >> my %tag; >> @@ -3471,6 +3493,10 @@ sub parse_tag { >> if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) { >> $tag{'author_name'} = $1; >> $tag{'author_email'} = $2; >> + if (gitweb_check_feature('email_privacy')) { >> + $tag{'author_email'} = "private"; >> + $tag{'author'} = hide_mailaddr($tag{'author'}); >> + } > > This code seems quite awkward, we've already done the regex match, but > this code: > >> [snip] >> +sub hide_mailaddr_if_private { >> + my $line = shift; >> + return $line unless (gitweb_check_feature('email_privacy') && >> + $line =~ m/^([^<]+) <([^>]*)>/); >> + return hide_mailaddr($line) >> +} >> + >> +sub hide_mailaddr { >> + my $mailaddr = shift; >> + $mailaddr =~ s/<([^>]*)>/<private>/; >> + return $mailaddr; >> +} > > Is going to do it again incrementally, and then just act on a > search-replacement if we've got the feature enabled. > I think hide_mailaddr_if_private (which checks the feature flag and looks for the pattern) is only called when we haven't done so already. In every other case we call hide_mailaddr directly. So there's no duplication of checks. > It seems much simpler to just turn your: > >> + if (gitweb_check_feature('email_privacy')) { >> + $tag{'author_email'} = "private"; >> + $tag{'author'} = hide_mailaddr($tag{'author'}); >> + } > > Into: > > $tag{'author'} = maybe_hide_mailaddr($tag{author}, > \$tag{author_email}); > > Using: > > sub maybe_hide_email { > my ($email, $ref) = shift; > return $email unless gitweb_check_feature('email_privacy'); > $$ref = "private" if $ref; > return hide_email($email); > } > > Which also works for the case where you don't have a "private" hash key > to assign to. But maybe it overcomplicates things... > >> } else { >> $tag{'author_name'} = $tag{'author'}; >> } >> @@ -3519,6 +3545,10 @@ sub parse_commit_text { >> if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) { >> $co{'author_name'} = $1; >> $co{'author_email'} = $2; >> + if (gitweb_check_feature('email_privacy')) { >> + $co{'author_email'} = "private"; >> + $co{'author'} = hide_mailaddr($co{'author'}); >> + } >> } else { >> $co{'author_name'} = $co{'author'}; >> } >> @@ -3529,6 +3559,10 @@ sub parse_commit_text { >> if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) { >> $co{'committer_name'} = $1; >> $co{'committer_email'} = $2; >> + if (gitweb_check_feature('email_privacy')) { >> + $co{'committer_email'} = "private"; >> + $co{'committer'} = hide_mailaddr($co{'committer'}); >> + } >> [...] >> } else { >> $co{'committer_name'} = $co{'committer'}; >> } >> @@ -3568,9 +3602,10 @@ sub parse_commit_text { >> if (! defined $co{'title'} || $co{'title'} eq "") { >> $co{'title'} = $co{'title_short'} = '(no commit message)'; >> } >> - # remove added spaces >> + # remove added spaces, redact e-mail addresses if applicable. >> foreach my $line (@commit_lines) { >> $line =~ s/^ //; >> + $line = hide_mailaddr_if_private($line); >> } >> $co{'comment'} = \@commit_lines; >> >> @@ -8060,8 +8095,9 @@ sub git_commitdiff { >> close $fd >> or print "Reading git-diff-tree failed\n"; >> } elsif ($format eq 'patch') { >> - local $/ = undef; >> - print <$fd>; >> + while (my $line = <$fd>) { >> + print hide_mailaddr_if_private($line); >> + } > > Urm, have you tested this? How does a while loop over a <$fd> make sense > when $/ is undef, the readline() operator will always return just one > record, so having a while loop doesn't make sense. > $/ is not undef, the opposite, it's the default value which results in reading from <$fd> one line at a time. (where line is a sequence terminated by CRLF or LF) > I'm not sure of the input here, but given that if you're expecting to > replace all e-mail addresses on all lines with this function that's not > how it'll work, the s/// doesn't have a /g, so it'll stop at the first > replacement. > So we are reading from <$fd> one line at a time and looking for the first "expr1 <expr2>" occurrence in each line. This successfully redacts the author, committer, signed-off-by lines. Not doing a global replacement per line is intentional so as to avoid false positives. So far I haven't noticed missing any addresses that should have been redacted. >> close $fd >> or print "Reading git-format-patch failed\n"; >> } >> >> base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f > >