Re: [PATCH] gitweb: handle non UTF-8 text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Alexandre, I hope that the patch attached would solve your problem.

Junio C Hamano wrote:
> Jakub Narebski <jnareb@xxxxxxxxx> writes:
> 
>> On Tue, 29 May 2007, Martin Koegler wrote:
>> ...
>>> But I agree, that there should be the possibilty to choose the
>>> fallback encoding.
>>
>> I think for the beginning it would be enough to have
>>
>>   # assume this charset if line contains non-UTF-8 characters
>>   our $fallback_encoding = "latin1";

Added this, with more elaborate comment, just before %feature hash.

>> or something like that (perhaps different wording in the comment,
>> perhaps different name of the variable) in the gitweb.perl for your
>> idea to be accepted.
>>
>> That, and using to_utf8 (as before e3ad95a8) and not my_decode_utf8
>> as subroutine name. If only it would be possible to avoid I think
>> quote costly "eval {....}" invocation...
> 
> Except that I had an impression that block form of "eval" (as
> opposed to "parse and evaluate string" kind) was not costly at
> all.

I have checked the time it took to run the gitweb test (t9500),
and the time to run (user+sys) increased about 1% after this patch,
which is even within range of error I think.
 
> Please make it so.

I have changed the name from my_decode_utf8 to the name used for thin
wrapper before commit e3ad95a8 "gitweb: use decode_utf8 directly", namely
to_utf8, and put it in the place where old to_utf8 subroutine was.

Instead of bit hackish "return $res || decode('latin1', $str);" use
"if (defined $res) { ... } else { ... }"; it avoids calling decode()
unnecessary for '' and '0' strings, which are also false, but do not mean
that decode_utf8 failed.

It uses explicit constant names, Encode::FB_CROAK instead of 1, and
Encode::FB_DEFAULT instead of default undef/0.

It adds very, very basic test: it does check _only_ if there are any
errors or warning which would go to web server log; it does not check
if the output is correct. It uses helper files from other i18n tests.

Added comments.


Still the main change is by Martin Koegler and he should be author of
this commit, I think. I have added S-o-b: by me.

-- >8 --
From: Martin Koegler <mkoegler@xxxxxxxxxxxxxxxxx>
Subject: [PATCH] gitweb: Handle non UTF-8 text better

gitweb assumes that everything is in UTF-8. If a text contains invalid
UTF-8 character sequences, the text must be in a different encoding.

This commit introduces $fallback_encoding which would be used as input
encoding if gitweb encounters text with is not valid UTF-8.

Add basic test for this in t/t9500-gitweb-standalone-no-errors.sh

Signed-off-by: Martin Koegler <mkoegler@xxxxxxxxxxxxxxxxx>
Signed-off-by: Jakub Narebski <jnareb@xxxxxxxxx>
---
 gitweb/gitweb.perl                     |   41 ++++++++++++++++++++++++-------
 t/t9500-gitweb-standalone-no-errors.sh |   28 +++++++++++++++++++++
 2 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index c3921cb..e92596c 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -94,6 +94,13 @@ our $default_text_plain_charset  = undef;
 # (relative to the current git repository)
 our $mimetypes_file = undef;
 
+# assume this charset if line contains non-UTF-8 characters;
+# it should be valid encoding (see Encoding::Supported(3pm) for list),
+# for which encoding all byte sequences are valid, for example
+# 'iso-8859-1' aka 'latin1' (it is decoded without checking, so it
+# could be even 'utf-8' for the old behavior)
+our $fallback_encoding = 'latin1';
+
 # You define site-wide feature defaults here; override them with
 # $GITWEB_CONFIG as necessary.
 our %feature = (
@@ -602,6 +609,20 @@ sub validate_refname {
 	return $input;
 }
 
+# decode sequences of octets in utf8 into Perl's internal form,
+# which is utf-8 with utf8 flag set if needed.  gitweb writes out
+# in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
+sub to_utf8 {
+	my $str = shift;
+	my $res;
+	eval { $res = decode_utf8($str, Encode::FB_CROAK); };
+	if (defined $res) {
+		return $res;
+	} else {
+		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
+	}
+}
+
 # quote unsafe chars, but keep the slash, even when it's not
 # correct, but quoted slashes look too horrible in bookmarks
 sub esc_param {
@@ -626,7 +647,7 @@ sub esc_html ($;%) {
 	my $str = shift;
 	my %opts = @_;
 
-	$str = decode_utf8($str);
+	$str = to_utf8($str);
 	$str = $cgi->escapeHTML($str);
 	if ($opts{'-nbsp'}) {
 		$str =~ s/ /&nbsp;/g;
@@ -640,7 +661,7 @@ sub esc_path {
 	my $str = shift;
 	my %opts = @_;
 
-	$str = decode_utf8($str);
+	$str = to_utf8($str);
 	$str = $cgi->escapeHTML($str);
 	if ($opts{'-nbsp'}) {
 		$str =~ s/ /&nbsp;/g;
@@ -925,7 +946,7 @@ sub format_subject_html {
 
 	if (length($short) < length($long)) {
 		return $cgi->a({-href => $href, -class => "list subject",
-		                -title => decode_utf8($long)},
+		                -title => to_utf8($long)},
 		       esc_html($short) . $extra);
 	} else {
 		return $cgi->a({-href => $href, -class => "list subject"},
@@ -1239,7 +1260,7 @@ sub git_get_projects_list {
 			if (check_export_ok("$projectroot/$path")) {
 				my $pr = {
 					path => $path,
-					owner => decode_utf8($owner),
+					owner => to_utf8($owner),
 				};
 				push @list, $pr;
 				(my $forks_path = $path) =~ s/\.git$//;
@@ -1269,7 +1290,7 @@ sub git_get_project_owner {
 			$pr = unescape($pr);
 			$ow = unescape($ow);
 			if ($pr eq $project) {
-				$owner = decode_utf8($ow);
+				$owner = to_utf8($ow);
 				last;
 			}
 		}
@@ -1759,7 +1780,7 @@ sub get_file_owner {
 	}
 	my $owner = $gcos;
 	$owner =~ s/[,;].*$//;
-	return decode_utf8($owner);
+	return to_utf8($owner);
 }
 
 ## ......................................................................
@@ -1842,7 +1863,7 @@ sub git_header_html {
 
 	my $title = "$site_name";
 	if (defined $project) {
-		$title .= " - " . decode_utf8($project);
+		$title .= " - " . to_utf8($project);
 		if (defined $action) {
 			$title .= "/$action";
 			if (defined $file_name) {
@@ -2116,7 +2137,7 @@ sub git_print_page_path {
 
 	print "<div class=\"page_path\">";
 	print $cgi->a({-href => href(action=>"tree", hash_base=>$hb),
-	              -title => 'tree root'}, decode_utf8("[$project]"));
+	              -title => 'tree root'}, to_utf8("[$project]"));
 	print " / ";
 	if (defined $name) {
 		my @dirname = split '/', $name;
@@ -2936,7 +2957,7 @@ sub git_project_list_body {
 		($pr->{'age'}, $pr->{'age_string'}) = @aa;
 		if (!defined $pr->{'descr'}) {
 			my $descr = git_get_project_description($pr->{'path'}) || "";
-			$pr->{'descr_long'} = decode_utf8($descr);
+			$pr->{'descr_long'} = to_utf8($descr);
 			$pr->{'descr'} = chop_str($descr, 25, 5);
 		}
 		if (!defined $pr->{'owner'}) {
@@ -3981,7 +4002,7 @@ sub git_snapshot {
 	my $git = git_cmd_str();
 	my $name = $project;
 	$name =~ s/\047/\047\\\047\047/g;
-	my $filename = decode_utf8(basename($project));
+	my $filename = to_utf8(basename($project));
 	my $cmd;
 	if ($suffix eq 'zip') {
 		$filename .= "-$hash.$suffix";
diff --git a/t/t9500-gitweb-standalone-no-errors.sh b/t/t9500-gitweb-standalone-no-errors.sh
index b92ab63..44ae503 100755
--- a/t/t9500-gitweb-standalone-no-errors.sh
+++ b/t/t9500-gitweb-standalone-no-errors.sh
@@ -487,4 +487,32 @@ test_expect_success \
 	'gitweb_run "p=.git;a=atom"'
 test_debug 'cat gitweb.log'
 
+# ----------------------------------------------------------------------
+# encoding/decoding
+
+test_expect_success \
+	'encode(commit): utf8' \
+	'. ../t3901-utf8.txt &&
+	 echo "UTF-8" >> file &&
+	 git add file &&
+	 git commit -F ../t3900/1-UTF-8.txt &&
+	 gitweb_run "p=.git;a=commit"'
+test_debug 'cat gitweb.log'
+
+test_expect_success \
+	'encode(commit): iso-8859-1' \
+	'. ../t3901-8859-1.txt &&
+	 echo "ISO-8859-1" >> file &&
+	 git add file &&
+	 git config i18n.commitencoding ISO-8859-1 &&
+	 git commit -F ../t3900/ISO-8859-1.txt &&
+	 git config --unset i18n.commitencoding &&
+	 gitweb_run "p=.git;a=commit"'
+test_debug 'cat gitweb.log'
+
+test_expect_success \
+	'encode(log): utf-8 and iso-8859-1' \
+	'gitweb_run "p=.git;a=log"'
+test_debug 'cat gitweb.log'
+
 test_done
-- 
1.5.2

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux