On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote: > From: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> > > Gitweb extracts content from the Git log and makes it accessible > over HTTP. As a result, e-mail addresses found in commits are > exposed to web crawlers and they may not respect robots.txt. > This may result in unsolicited messages. > This is a feature for redacting e-mail addresses > from the generated HTML, etc. content. > > This feature does not prevent someone from downloading the > unredacted commit log, e.g., by cloning the repository, and > extracting information from it. > It aims to hinder the low-effort bulk collection of e-mail > addresses by web crawlers. > > Changes since v1: > - Turned off the feature by default. > - Removed duplicate code. > - Added note about Gitweb consumers receiving redacted logs. > > Changes since v2: > - The feature can be set on a per-project basis. ('override' => 1) > > Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> > --- > gitweb: Redacted e-mail addresses feature. > > Gitweb extracts content from the Git log and makes it accessible over > HTTP. As a result, e-mail addresses found in commits are exposed to web > crawlers. This may result in unsolicited messages. This is a feature for > redacting e-mail addresses from the generated HTML content. > > This feature does not prevent someone from downloading the unredacted > commit log and extracting information from it. It aims to hinder the > low-effort bulk collection of e-mail addresses by web crawlers. > > Signed-off-by: Georgios Kontaxis geko1702+commits@xxxxxxxxx > > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3 > Pull-Request: https://github.com/gitgitgadget/git/pull/910 > > Range-diff vs v2: > > 1: 74af11ca8bf2 ! 1: 930cdefe7ee0 gitweb: redacted e-mail addresses feature. > @@ Commit message > - Removed duplicate code. > - Added note about Gitweb consumers receiving redacted logs. > > + Changes since v2: > + - The feature can be set on a per-project basis. ('override' => 1) > + > Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> > > ## Documentation/gitweb.conf.txt ## > @@ gitweb/gitweb.perl: sub evaluate_uri { > + # $feature{'email_privacy'}{'default'} = [1]; > + 'email_privacy' => { > + 'sub' => sub { feature_bool('email_privacy', @_) }, > -+ 'override' => 0, > ++ 'override' => 1, > + 'default' => [0]}, > ); > > > > Documentation/gitweb.conf.txt | 16 +++++++++++++ > gitweb/gitweb.perl | 42 ++++++++++++++++++++++++++++++++--- > 2 files changed, 55 insertions(+), 3 deletions(-) > > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt > index 7963a79ba98b..b7af3240177d 100644 > --- a/Documentation/gitweb.conf.txt > +++ b/Documentation/gitweb.conf.txt > @@ -896,6 +896,22 @@ same as of the snippet above: > It is an error to specify a ref that does not pass "git check-ref-format" > scrutiny. Duplicated values are filtered. > > +email_privacy:: > + Redact e-mail addresses from the generated HTML, etc. content. > + This hides e-mail addresses found in the commit log from web crawlers. > + Disabled by default. > ++ > +It is highly recommended to enable this feature unless web crawlers are > +hindered in some other way. Note that crawlers intent on harvesting e-mail > +addresses may disregard robots.txt. You can enable this feature like so: > ++ > +--------------------------------------------------------------------------- > +$feature{'email_privacy'}{'default'} = [1]; > +--------------------------------------------------------------------------- > ++ > +Note that if Gitweb is not the final step in a workflow then subsequent > +steps may misbehave because of the redacted information they receive. > + > > EXAMPLES > -------- > diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl > index 0959a782eccb..174cc566d97d 100755 > --- a/gitweb/gitweb.perl > +++ b/gitweb/gitweb.perl > @@ -569,6 +569,15 @@ sub evaluate_uri { > 'sub' => \&feature_extra_branch_refs, > 'override' => 0, > 'default' => []}, > + > + # Redact e-mail addresses. > + > + # To enable system wide have in $GITWEB_CONFIG > + # $feature{'email_privacy'}{'default'} = [1]; > + 'email_privacy' => { > + 'sub' => sub { feature_bool('email_privacy', @_) }, > + 'override' => 1, > + 'default' => [0]}, > ); > > sub gitweb_get_feature { > @@ -3449,6 +3458,19 @@ sub parse_date { > return %date; > } > > [snip] So in the v1 feedback I suggested: BEGIN QUOTE sub maybe_hide_email { my $email = shift; return $email unless gitweb_check_feature('email_privacy'); return hide_email($email); } then: $tag{author_email} = maybe_hide_email($2); END QUOTE But: > sub parse_tag { > my $tag_id = shift; > my %tag; > @@ -3471,6 +3493,10 @@ sub parse_tag { > if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) { > $tag{'author_name'} = $1; > $tag{'author_email'} = $2; > + if (gitweb_check_feature('email_privacy')) { > + $tag{'author_email'} = "private"; > + $tag{'author'} = hide_mailaddr($tag{'author'}); > + } This code seems quite awkward, we've already done the regex match, but this code: > [snip] > +sub hide_mailaddr_if_private { > + my $line = shift; > + return $line unless (gitweb_check_feature('email_privacy') && > + $line =~ m/^([^<]+) <([^>]*)>/); > + return hide_mailaddr($line) > +} > + > +sub hide_mailaddr { > + my $mailaddr = shift; > + $mailaddr =~ s/<([^>]*)>/<private>/; > + return $mailaddr; > +} Is going to do it again incrementally, and then just act on a search-replacement if we've got the feature enabled. It seems much simpler to just turn your: > + if (gitweb_check_feature('email_privacy')) { > + $tag{'author_email'} = "private"; > + $tag{'author'} = hide_mailaddr($tag{'author'}); > + } Into: $tag{'author'} = maybe_hide_mailaddr($tag{author}, \$tag{author_email}); Using: sub maybe_hide_email { my ($email, $ref) = shift; return $email unless gitweb_check_feature('email_privacy'); $$ref = "private" if $ref; return hide_email($email); } Which also works for the case where you don't have a "private" hash key to assign to. But maybe it overcomplicates things... > } else { > $tag{'author_name'} = $tag{'author'}; > } > @@ -3519,6 +3545,10 @@ sub parse_commit_text { > if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) { > $co{'author_name'} = $1; > $co{'author_email'} = $2; > + if (gitweb_check_feature('email_privacy')) { > + $co{'author_email'} = "private"; > + $co{'author'} = hide_mailaddr($co{'author'}); > + } > } else { > $co{'author_name'} = $co{'author'}; > } > @@ -3529,6 +3559,10 @@ sub parse_commit_text { > if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) { > $co{'committer_name'} = $1; > $co{'committer_email'} = $2; > + if (gitweb_check_feature('email_privacy')) { > + $co{'committer_email'} = "private"; > + $co{'committer'} = hide_mailaddr($co{'committer'}); > + } > [...] > } else { > $co{'committer_name'} = $co{'committer'}; > } > @@ -3568,9 +3602,10 @@ sub parse_commit_text { > if (! defined $co{'title'} || $co{'title'} eq "") { > $co{'title'} = $co{'title_short'} = '(no commit message)'; > } > - # remove added spaces > + # remove added spaces, redact e-mail addresses if applicable. > foreach my $line (@commit_lines) { > $line =~ s/^ //; > + $line = hide_mailaddr_if_private($line); > } > $co{'comment'} = \@commit_lines; > > @@ -8060,8 +8095,9 @@ sub git_commitdiff { > close $fd > or print "Reading git-diff-tree failed\n"; > } elsif ($format eq 'patch') { > - local $/ = undef; > - print <$fd>; > + while (my $line = <$fd>) { > + print hide_mailaddr_if_private($line); > + } Urm, have you tested this? How does a while loop over a <$fd> make sense when $/ is undef, the readline() operator will always return just one record, so having a while loop doesn't make sense. I'm not sure of the input here, but given that if you're expecting to replace all e-mail addresses on all lines with this function that's not how it'll work, the s/// doesn't have a /g, so it'll stop at the first replacement. > close $fd > or print "Reading git-format-patch failed\n"; > } > > base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f