Junio C Hamano wrote: > Jakub Narebski <jnareb@xxxxxxxxx> writes: > >> # quote unsafe characters and escape filename to HTML >> sub esc_path { >> my $str = shift; >> $str = esc_html($str); >> $str =~ s!([[:cntrl:]])!sprintf('<span class="cntrl">&#%04d;</span>', 9216+ord($1))!eg; >> return $str; >> } >> >> with perhaps the following CSS >> >> span.cntrl { >> border: dashed #aaaaaa; >> border-width: 1px; >> padding: 0px 2px 0px 2px; >> margin: 0px 2px 0px 2px; >> } >> >> What do you think of it? > > Probably "# quote unsafe characters" is not what it does yet (it > just quotes controls currently and nothing else), but we have to > start somewhere and I think this is a good start. Well, control characters (at least some of them) are not correct characters in UTF-8 HTML output; Mozilla in strict XHTML mode complains. Currently for example esc_html escapes FORM FEED (FF) and ESCAPE (ESC) characters, because they happened to be present in git.git repository (in COPYING file and in commit v1.4.2.1-g20a3847 respectively). As I see it, we can either replace non-safe characters (control characters) by single characters a la --hide-control-chars: that is minimal solution, or we can quote unseafe characters somewhat, but if we do that we have to indicate that we quote. Git core and ls encloses material which needs escaping with quotes; in gitweb it is somewhat impractical; besides we have more possibilities to mark fragment of text (span element encompassing representation of escaped characters for example). I have thought of the following escaping: 1. Hide control characters using '?' or other similar character like ċ for example 2. Use "Unicode" quoting, i.e. replace control characters by their Unicode Printable Representation (PR), as shown above. Has the advantage that it is simple and does not need theoretically marking that it is quoted; has the disadvantage that browser must support this part of Unicode, and that those characters are less than readable with default font size gitweb uses. 3. Use Character Escape Codes (CEC), using alphabetic and octal backslash sequences like those used in C. Probably need to escape backslash (quoting character) too. Has the advantage of being widely understood in POSIX world. Has the disadvantage of need for escape sequence table/hash. Has the advantage that it works for all characters - simple octal backslash sequence if they have no special escape sequence. 4. Control key Sequence (CS), like the one used in esc_html currently, replacing control characters by key sequence that produces them, for example replacing LF with ^J, CR with ^M, FF with ^L, ESC with ^[, TAB with ^I. Has the advantage of being undestodd I think in MS-DOS/MS WIndows world. Has the advantage of being used in esc_html. Has the advantage that some text editors use this representation. Has the disadvantage of need for large key sequence table/hash. Has the disadvantage that less common control characters have cryptic control key sequences. 5. Percent encoding, also know as URL encoding. Use %<hex> encoding used in URL, taken for example from core of esc_url/esc_param subroutine. Simple, but does need marking that is escaped. Disadvantage of hardly readable. Which solution do you think it's best? Or perhaps other solution, like using Unicode Printable Representation, or Character Escape Codes with the exception of LF which would be replaced by ¶ (paragraph sign), RET by ↵ and TAB by either þ, ↦ or →. -- Jakub Narebski Poland - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html