> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@xxxxxxxxx> > writes: > >> From: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> >> Gitweb extracts content from the Git log and makes it accessible >> over HTTP. As a result, e-mail addresses found in commits are >> exposed to web crawlers and they may not respect robots.txt. >> This may result in unsolicited messages. >> This is a feature for redacting e-mail addresses >> from the generated HTML, etc. content. >> >> This feature does not prevent someone from downloading the >> unredacted commit log, e.g., by cloning the repository, and >> extracting information from it. >> It aims to hinder the low-effort bulk collection of e-mail >> addresses by web crawlers. > > Lines from here ... > >> >> Changes since v1: >> - Turned off the feature by default. >> - Removed duplicate code. >> - Added note about Gitweb consumers receiving redacted logs. >> >> Changes since v2: >> - The feature can be set on a per-project basis. ('override' => 1) > > ... to here are to help reviewers on the mailing list while > iterations of the same change is being reviewed and polished. > > But it is useless noise for those who only read "git log". They > simply do not have access to earlier iterations. > > Such "changes between iterations" comments needs to be written after > the three-dash lines. > >> Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> --- >> gitweb: Redacted e-mail addresses feature. >> >> Gitweb extracts content from the Git log and makes it accessible >> over >> HTTP. As a result, e-mail addresses found in commits are exposed to >> web >> crawlers. This may result in unsolicited messages. This is a feature >> for >> redacting e-mail addresses from the generated HTML content. >> >> This feature does not prevent someone from downloading the >> unredacted >> commit log and extracting information from it. It aims to hinder the >> low-effort bulk collection of e-mail addresses by web crawlers. >> >> Signed-off-by: Georgios Kontaxis geko1702+commits@xxxxxxxxx >> >> Published-As: >> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3 >> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git >> pr-910/kontaxis/kontaxis/email_privacy-v3 >> Pull-Request: https://github.com/gitgitgadget/git/pull/910 >> >> Range-diff vs v2: >> >> 1: 74af11ca8bf2 ! 1: 930cdefe7ee0 gitweb: redacted e-mail addresses >> feature. >> @@ Commit message >> - Removed duplicate code. >> - Added note about Gitweb consumers receiving redacted logs. >> >> + Changes since v2: >> + - The feature can be set on a per-project basis. ('override' >> => 1) >> + >> Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx> >> >> ## Documentation/gitweb.conf.txt ## >> @@ gitweb/gitweb.perl: sub evaluate_uri { >> + # $feature{'email_privacy'}{'default'} = [1]; >> + 'email_privacy' => { >> + 'sub' => sub { feature_bool('email_privacy', @_) }, >> -+ 'override' => 0, >> ++ 'override' => 1, >> + 'default' => [0]}, >> ); >> >> >> >> Documentation/gitweb.conf.txt | 16 +++++++++++++ >> gitweb/gitweb.perl | 42 ++++++++++++++++++++++++++++++++--- >> 2 files changed, 55 insertions(+), 3 deletions(-) >> >> diff --git a/Documentation/gitweb.conf.txt >> b/Documentation/gitweb.conf.txt >> index 7963a79ba98b..b7af3240177d 100644 >> --- a/Documentation/gitweb.conf.txt >> +++ b/Documentation/gitweb.conf.txt >> @@ -896,6 +896,22 @@ same as of the snippet above: >> It is an error to specify a ref that does not pass "git >> check-ref-format" >> scrutiny. Duplicated values are filtered. >> >> +email_privacy:: >> + Redact e-mail addresses from the generated HTML, etc. content. >> + This hides e-mail addresses found in the commit log from web >> crawlers. >> + Disabled by default. >> ++ >> +It is highly recommended to enable this feature unless web crawlers are >> +hindered in some other way. Note that crawlers intent on harvesting >> e-mail >> +addresses may disregard robots.txt. > > Up to this line is more-or-less OK, but with a few comments: > > - "This hides ... from web crawlers"? Doesn't this hide it > indiscriminately, whether the requester is an interactive > end-user or a crawler? > Correct. It doesn't say "it only hides from". I can see how it may be confusing. I'll rephrase. > - The name of the configured feature, 'email_privacy'. There is > another existing feature with underscore in its name > (remote_heads), but I think it is an odd-ball mistake we do not > want to imitate. Instead, I think the "extra branch refs" may be > a good example to follow. The feature name (written in Perl > code) is "extra-branch-refs" (downcased words with inter-word > dashes) and the corresponding configuration (I am not saying we > should add one to support this feature, if we do not have one > already) is "gitweb.extraBranchRefs" (camelCased words). > I'll rename the feature. > - Do not exaggerate with words like "highly", but trust the > intelligence of your readers to make the right decision when they > understand the reason why this feature exists in the first place. > Will remove. > - I do not think this entry should be added to the end of feature > list. How about listing it just after 'avatar', which is also > about how author/committer identity is shown? > Will do. >> +--------------------------------------------------------------------------- >> +$feature{'email_privacy'}{'default'} = [1]; >> +--------------------------------------------------------------------------- > > I do not think this should be here. None of the boolean features > listed early in "Features in `%feature`" section like blame, > snapshot, grep, pickaxe, ... features tell readers how to enable a > specific feature like that. > > Unless the syntax to configure a feature is one-off oddball that is > different from all other features, we shouldn't clutter the > description with an example like this. > > At the beginning of the section "Configuring Gitweb Features" there > is a general description and that should be sufficient (and if it is > not sufficient, it is OK to add an example to that section). > Will remove. >> +Note that if Gitweb is not the final step in a workflow then subsequent >> +steps may misbehave because of the redacted information they receive. > > In other words, this will break crawlers that expect email addresses > are there and want to use it for some good purpose. But that is an > natural consequence of the "feature", and should be described as > such when we said "Redact e-mail addresses", not as a "by the way" > mention in a footnote. > > ... after reading more, readers realize that the damage is > far worse in the current incarnation of the patch, but let's > read on ... > Will move this out of the footnote. >> @@ -3449,6 +3458,19 @@ sub parse_date { >> return %date; >> } >> >> +sub hide_mailaddr_if_private { >> + my $line = shift; >> + return $line unless (gitweb_check_feature('email_privacy') && >> + $line =~ m/^([^<]+) <([^>]*)>/); > > I find that the second line is way too deeply indented. Wouldn't > >> + return $line unless (gitweb_check_feature('email_privacy') && >> + $line =~ m/^([^<]+) <([^>]*)>/); > > be enough? > Will adjust. >> + return hide_mailaddr($line) >> +} >> + >> +sub hide_mailaddr { >> + my $mailaddr = shift; >> + $mailaddr =~ s/<([^>]*)>/<private>/; > > s/private/redacted/ perhaps? > Will adjust. >> @@ -3568,9 +3602,10 @@ sub parse_commit_text { >> if (! defined $co{'title'} || $co{'title'} eq "") { >> $co{'title'} = $co{'title_short'} = '(no commit message)'; >> } >> - # remove added spaces >> + # remove added spaces, redact e-mail addresses if applicable. >> foreach my $line (@commit_lines) { >> $line =~ s/^ //; >> + $line = hide_mailaddr_if_private($line); > > A commit log that ends with > > Thank you, A <a@xxxxxxxxxxx> and B <b@xxxxxxxxxxx>, for discovering > this bug and quickly proposing a solution. > > would recact only the first one but not the other, I suspect. > > Much more problematic is that I see too many hits from: > > $ git log v2.30.0..v2.31.0 | grep ' <[^>@]*>' > > That is, "find a line in the commit log message that has a SP, open <bra, > run of characters that are not ket> or at@sign, closed with ket>". These > are clearly not e-mail addresses. > > This "feature" butchers a commit title: > > mktag doc: say <hash> not <sha1> > > to > > mktag doc: say <private> not <sha1> > > doesn't it? > True. Thanks for pointing these out. Sounds like the pattern we're looking for needs to be more specific. >> @@ -8060,8 +8095,9 @@ sub git_commitdiff { >> close $fd >> or print "Reading git-diff-tree failed\n"; >> } elsif ($format eq 'patch') { >> - local $/ = undef; >> - print <$fd>; >> + while (my $line = <$fd>) { >> + print hide_mailaddr_if_private($line); >> + } > > And this is even worse. > > $ git log -p --no-merges v2.30.0..v2.31.0 | grep ' <[^>@]*>' | wc -l > > gives me ~4700 hits. Because the patch text as well as the log > message is munged, the "feature" makes the output utterly unreliable > as an input to "git am" by munging too much. Interesting examples > are like these: > > - for (i = 0; i < pairs->nr; ++i) { > 'git <command> [<revision>...] -- [<file>...]' > + for (i = split; i < geometry->pack_nr; i++) { > > Now, I am *NOT* saying that you should tighten the e-mail address > syntax detection and keep munging the patch text. The lines that > begin with minus and SP in a patch must match the preimage text, > so munging out existing e-mail addresses from them will make the > patch fail to find the part that needs to be modified. And > replacing an e-mail address on a line that begins with plus would > redact it from the source---if they wrote an address, they must have > meant it to be available to those who consume the source, so I doubt > the wisdom of munging the patch part at all. > > I may be sympathetic to the cause of the patch, but, I do not agree > with its execution in this iteration of the patch. > I see your point. It seems hiding e-mail addresses should be limited to the commit message, i.e., stop at the "---" line. > Thanks. > >