Re: [PATCH 2/n] gitweb: Use '&iquot;' instead of '?' in esc_path

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 3 Nov 2006 23:33:49 +0100

Junio C Hamano wrote:
> Jakub Narebski <jnareb@xxxxxxxxx> writes:
> 
>> # quote unsafe characters and escape filename to HTML
>> sub esc_path {
>> 	my $str = shift;
>> 	$str = esc_html($str);
>> 	$str =~ s!([[:cntrl:]])!sprintf('<span 
class="cntrl">&#%04d;</span>', 9216+ord($1))!eg;
>> 	return $str;
>> }
>>
>> with perhaps the following CSS
>>
>> span.cntrl {
>> 	border: dashed #aaaaaa;
>> 	border-width: 1px;
>> 	padding: 0px 2px 0px 2px;
>> 	margin:  0px 2px 0px 2px;
>> }
>>
>> What do you think of it?
> 
> Probably "# quote unsafe characters" is not what it does yet (it
> just quotes controls currently and nothing else), but we have to
> start somewhere and I think this is a good start.

Well, control characters (at least some of them) are not correct
characters in UTF-8 HTML output; Mozilla in strict XHTML mode complains.
Currently for example esc_html escapes FORM FEED (FF) and ESCAPE (ESC)
characters, because they happened to be present in git.git repository
(in COPYING file and in commit v1.4.2.1-g20a3847 respectively).

As I see it, we can either replace non-safe characters (control
characters) by single characters a la --hide-control-chars: that
is minimal solution, or we can quote unseafe characters somewhat,
but if we do that we have to indicate that we quote. Git core and
ls encloses material which needs escaping with quotes; in gitweb
it is somewhat impractical; besides we have more possibilities
to mark fragment of text (span element encompassing representation
of escaped characters for example).

I have thought of the following escaping:

1. Hide control characters using '?' or other similar character like
   &cdot; for example
2. Use "Unicode" quoting, i.e. replace control characters by their
   Unicode Printable Representation (PR), as shown above. Has the
   advantage that it is simple and does not need theoretically marking
   that it is quoted; has the disadvantage that browser must support
   this part of Unicode, and that those characters are less than
   readable with default font size gitweb uses.
3. Use Character Escape Codes (CEC), using alphabetic and octal
   backslash sequences like those used in C. Probably need to escape
   backslash (quoting character) too. Has the advantage of being widely
   understood in POSIX world. Has the disadvantage of need for escape
   sequence table/hash. Has the advantage that it works for all
   characters - simple octal backslash sequence if they have no special
   escape sequence.
4. Control key Sequence (CS), like the one used in esc_html currently,
   replacing control characters by key sequence that produces them,
   for example replacing LF with ^J, CR with ^M, FF with ^L, ESC with
   ^[, TAB with ^I. Has the advantage of being undestodd I think in
   MS-DOS/MS WIndows world. Has the advantage of being used in esc_html.
   Has the advantage that some text editors use this representation.
   Has the disadvantage of need for large key sequence table/hash.
   Has the disadvantage that less common control characters have cryptic
   control key sequences.
5. Percent encoding, also know as URL encoding. Use %<hex> encoding used
   in URL, taken for example from core of esc_url/esc_param subroutine.
   Simple, but does need marking that is escaped. Disadvantage of hardly
   readable.

Which solution do you think it's best? Or perhaps other solution, like 
using Unicode Printable Representation, or Character Escape Codes with 
the exception of LF which would be replaced by &para; (paragraph sign), 
RET by &crarr; and TAB by either &thorn;, &#8614; or &rarr;.

-- 
Jakub Narebski
Poland
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html