Re: [PATCH v4] gitweb: redacted e-mail addresses feature.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@xxxxxxxxx>
> writes:
>
> [note to other reviewers. input from those who are more familiar
> with gitweb and Perl is very much appreciated on this patch].
>
>> From: Georgios Kontaxis <geko1702+commits@xxxxxxxxx>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>
> "... are exposed to web crawlers, which spammers may use." would be
> sufficient as a problem description.
>
> After giving a problem description, it is customery to describe the
> solution as if you are ordering the codebase to "be like so", so
> instead of this ...
>
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>
> ... we may say something like
>
>     Introduce an 'email-privacy' feature, which defaults to false,
>     that redacts e-mail addresses that appear as author/committer
>     info and in log messages from the generated HTML content.
>
>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> And this is a good thing to add.  Overall, nicely written.
>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@xxxxxxxxx>
>> ---
>>     gitweb: redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers and they may not respect robots.txt. This may result in
>>     unsolicited messages. This is a feature for redacting e-mail
>> addresses
>>     from the generated HTML, etc. content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log, e.g., by cloning the repository, and extracting
>> information
>>     from it. It aims to hinder the low-effort bulk collection of e-mail
>>     addresses by web crawlers.
>
> You do not need to repeat the above, which is in the log message above.
>
>>     Changes since v1:
>>
>>      * Turned off the feature by default.
>>      * Removed duplicate code.
>>      * Added note about Gitweb consumers receiving redacted logs.
>>
>>     Changes since v2:
>>
>>      * The feature can be set on a per-project basis. ('override' => 1)
>>
>>     Changes since v3:
>>
>>      * Renamed feature to "email-privacy" and improved documentation.
>>      * Removed UI elements for git-format-patch since it won't be
>> redacted.
>>      * Simplified calls to the address redaction logic.
>>      * Mail::Address is now used to reduce false-positive redactions.
>
> Having these under the --- line like this is very helpful.
>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..6630c76d92fd 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -21,6 +21,7 @@
>>  use File::Basename qw(basename);
>>  use Time::HiRes qw(gettimeofday tv_interval);
>>  use Digest::MD5 qw(md5_hex);
>> +use Git::LoadCPAN::Mail::Address;
>
> I'll defer to others who are more familiar with gitweb and Perl
> ecosystem if this is warranted, but I have a feeling that importing
> and using Mail::Address->parse() only because we want to see if a
> given "<string>" is an address is a bit overkill and it might be
> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>
>> +	# Redact e-mail addresses.
>> +
>> +	# To enable system wide have in $GITWEB_CONFIG
>> +	# $feature{'email-privacy'}{'default'} = [1];
>> +	'email-privacy' => {
>> +		'sub' => sub { feature_bool('email-privacy', @_) },
>> +		'override' => 1,
>> +		'default' => [0]},
>>  );
>>
>>  sub gitweb_get_feature {
>> @@ -3449,6 +3459,32 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> +sub is_mailaddr {
>> +	my @addrs = Mail::Address->parse(shift);
>> +	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
>> +		return 0;
>> +	}
>> +	return 1;
>> +}
>> +
>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	while ($line =~ m/(<[^>]+>)/g) {
>> +		my $match = $1;
>> +		if (!is_mailaddr($match)) {
>> +			next;
>> +		}
>> +		my $offset = pos $line;
>> +		my $head = substr $line, 0, $offset - length($match);
>> +		my $redaction = "<redacted>";
>> +		my $tail = substr $line, $offset;
>> +		$line = $head . $redaction . $tail;
>> +		pos $line = length($head) + length($redaction);
>
> Hmmmm, Perl suggestions from others?  It looks quite strange to see
> that s/// operator is not used and replacement is done manually with
> byte position in a Perl script.
>
>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3465,7 +3501,7 @@ sub parse_tag {
>>  		} elsif ($line =~ m/^tag (.+)$/) {
>>  			$tag{'name'} = $1;
>>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
>> -			$tag{'author'} = $1;
>> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>>  			$tag{'author_epoch'} = $2;
>>  			$tag{'author_tz'} = $3;
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>
> This (and the others that follow the same pattern) looks sensible.
>
>> @@ -7489,7 +7526,8 @@ sub git_log_generic {
>>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>>  	}
>>  	my $patch_max = gitweb_get_feature('patches');
>> -	if ($patch_max && !defined $file_name) {
>> +	if ($patch_max && !defined $file_name &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>>  			$paging_nav .= " &sdot; " .
>>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
>> @@ -7550,7 +7588,8 @@ sub git_commit {
>>  			} @$parents ) .
>>  			')';
>>  	}
>> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
>> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		$formats_nav .= " | " .
>>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  				"patch");
>> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
>>  		$formats_nav =
>>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>>  			        "raw");
>> -		if ($patch_max && @{$co{'parents'}} <= 1) {
>> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
>> +			!gitweb_check_feature('email-privacy')) {
>>  			$formats_nav .= " | " .
>>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  					"patch");
>
> I wouldn't have expected to see the above three hunks in the
> "patch" codepath.  Rather, I was hoping that you'd do something
> like this at startup when you find out that the privacy feature
> is enabled:
>
> 	$feature{'patches'}{'default'} = [0]
> 		if gitweb_get_feature('email-privacy');
>
> so that nothing related to the 'patches' has to be modified.
> That way, even if there were fourth place that can leak an e-mail
> address in the 'patch' codepath that above three hunks in this patch
> missed, crawlers won't be able to get at it, no?
>
I forgot to mention one of the reasons for doing it this way.
If someone knows this URL exists and is relying on it (e.g., in a script)
they can still call it with "email-privacy" enabled.
The only thing that's changing is that the UI element (link) that a web
crawler would follow just because it's there is now gone.

But, as mentioned before, I don't feel strongly about either approach.
Let me know.

>> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
>> index 1f32ca66ea51..77fc1298d4c6 100644
>> --- a/t/lib-gitweb.sh
>> +++ b/t/lib-gitweb.sh
>> @@ -67,6 +67,9 @@ gitweb_run () {
>>  	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
>>  	export GITWEB_CONFIG
>>
>> +	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
>> +	export PERL5LIB
>> +
>
> Why is this change suddenly become necessary?  This addition is only
> about tests---do we need to do something similar in the runtime
> environment when the updated gitweb that requires Mail::Address gets
> deployed, or is that covered already somewhere else?
>
> Thanks.
>





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux