Re: [PATCH] gitweb: use highlight's shebang detection

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 20, 2016, at 01:22 PM, Jakub Narębski wrote:
> W dniu 06.09.2016 o 21:00, Ian Kelling pisze:
>
> > The highlight binary can detect language by shebang when we can't tell
> > the syntax type by the name of the file.
>
> Was it something always present among highlight[1] binary capabilities,
> or is it something present only in new enough highlight app?  Or only
> in some specific fork / specific binary?  I couldn't find language
> detection in highlight[1] documentation...
>
> [1]: http://www.andre-simon.de/doku/highlight/en/highlight.php

Search for the word shebang, it's mentioned twice.

>
> If this feature is available only for some version, or for some
> highlighters, gitweb would have to provide an option to configure
> it.  It might be an additional configuration variable, it might
> be a special value in the %highlight_basename or %highlight_ext.

Good question. It was added upstream in 2007, and I tested that it's
functioning in the earliest distros I have easy access to: ubuntu 14.04
and debian wheezy.

>
> >                                          To use highlight's shebang
> > detection, add highlight to the pipeline whenever highlight is enabled.
>
> This describes what this patch does, but the sentence feels
> a bit convoluted, as it is stated.
>

Agreed. I've changed it in v2 of the patch, and perhaps this will make
the rest of the patch clearer too. The new paragraph is:

    The highlight binary can detect language by shebang when we can't
    tell
    the syntax type by the name of the file. In that case, pass the blob
    to "highlight --force" and the resulting html will have markup for
    highlighting if the language was detected.



> >
> > Document the shebang detection and add a test which exercises it in
> > t/t9500-gitweb-standalone-no-errors.sh.
>
> Nice!
>
> >
> > Signed-off-by: Ian Kelling <ian@xxxxxxxxxxxxxx>
> > ---
> >
> > Notes:
> >     I wondered if adding highlight to the pipeline would make viewing a blob
> >     with no highlighting take longer but it did not on my computer. I found
> >     no noticeable impact on small files and strangely, on a 159k file, it
> >     took 7% less time averaged over several requests.
>
> Strange.  I would guess that invoking separate binary and perl would
> always
> add to the time (especially on operation systems where forking / running
> command is expensive... though those are not often used with web servers,
> isn't it).

I dug into this a little more, and I think it's because when we call
highlight, we later call sanitize() instead of esc_html(). sanitize() is
faster and makes up for the extra time highlight takes. I ran a test on
my machine calling sanitize and esc_html on each line of gitweb.perl 100
times: 7.4s for sanitize, 12.4s for esc_html.

>
> >
> >  Documentation/gitweb.conf.txt          | 21 ++++++++++++++-------
> >  gitweb/gitweb.perl                     | 10 +++++-----
> >  t/t9500-gitweb-standalone-no-errors.sh | 18 +++++++++++++-----
> >  3 files changed, 32 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> > index a79e350..e632089 100644
> > --- a/Documentation/gitweb.conf.txt
> > +++ b/Documentation/gitweb.conf.txt
> > @@ -246,13 +246,20 @@ $highlight_bin::
> >  	Note that 'highlight' feature must be set for gitweb to actually
> >  	use syntax highlighting.
> >  +
> > -*NOTE*: if you want to add support for new file type (supported by
> > -"highlight" but not used by gitweb), you need to modify `%highlight_ext`
> > -or `%highlight_basename`, depending on whether you detect type of file
> > -based on extension (for example "sh") or on its basename (for example
> > -"Makefile").  The keys of these hashes are extension and basename,
> > -respectively, and value for given key is name of syntax to be passed via
> > -`--syntax <syntax>` to highlighter.
> > +*NOTE*: for a file to be highlighted, its syntax type must be detected
> > +and that syntax must be supported by "highlight".  The default syntax
> > +detection is minimal, and there are many supported syntax types with no
> > +detection by default.  There are three options for adding syntax
> > +detection.  The first and second priority are `%highlight_basename` and
> > +`%highlight_ext`, which detect based on basename (the full filename, for
> > +example "Makefile") and extension (for example "sh").  The keys of these
> > +hashes are the basename and extension, respectively, and the value for a
> > +given key is the name of the syntax to be passed via `--syntax <syntax>`
> > +to "highlight".  The last priority is the "highlight" configuration of
> > +`Shebang` regular expressions to detect the language based on the first
> > +line in the file, (for example, matching the line "#!/bin/bash").  See
> > +the highlight documentation and the default config at
> > +/etc/highlight/filetypes.conf for more details.
>
> All right; in addition to expanding the docs, it also improves them.

Noted in v2 commit log.

>
> >  +
> >  For example if repositories you are hosting use "phtml" extension for
> >  PHP files, and you want to have correct syntax-highlighting for those
> > diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> > index 33d701d..a672181 100755
> > --- a/gitweb/gitweb.perl
> > +++ b/gitweb/gitweb.perl
> > @@ -3931,15 +3931,16 @@ sub guess_file_syntax {
> >  # or return original FD if no highlighting
> >  sub run_highlighter {
> >  	my ($fd, $highlight, $syntax) = @_;
> > -	return $fd unless ($highlight && defined $syntax);
> > +	return $fd unless ($highlight);
>
> Here we would have check if we want / can invoke "highlight".

I think it's right as is. $highlight says the user wants highlighting,
and now we still want to invoke it when we do not know $syntax.

While I was double checking, I noticed there was an unused parameter to
guess_file_syntax(), $mimetype, which could easily make this not
obvious. I removed it in v2.

>
> >
> >  	close $fd;
> > +	my $syntax_arg = (defined $syntax) ? "--syntax $syntax" : "--force";
> >  	open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ".
> >  	          quote_command($^X, '-CO', '-MEncode=decode,FB_DEFAULT', '-pse',
> >  	            '$_ = decode($fe, $_, FB_DEFAULT) if !utf8::decode($_);',
> >  	            '--', "-fe=$fallback_encoding")." | ".
> >  	          quote_command($highlight_bin).
> > -	          " --replace-tabs=8 --fragment --syntax $syntax |"
> > +	          " --replace-tabs=8 --fragment $syntax_arg |"
> >  		or die_error(500, "Couldn't open file or run syntax highlighter");
> >  	return $fd;
> >  }
>
> All right (well, except for the question asked at the beginning).
>
> > @@ -7063,8 +7064,7 @@ sub git_blob {
> >
> >  	my $highlight = gitweb_check_feature('highlight');
> >  	my $syntax = guess_file_syntax($highlight, $mimetype, $file_name);
> > -	$fd = run_highlighter($fd, $highlight, $syntax)
> > -		if $syntax;
>
> Hmmm... it looks like the old code checked if there was $syntax defined
> twice: once for truthy value in caller, once for definedness in
> run_highlighter().
>
> > +	$fd = run_highlighter($fd, $highlight, $syntax);
>
> All right.
>
> >
> >  	git_header_html(undef, $expires);
> >  	my $formats_nav = '';
> > @@ -7117,7 +7117,7 @@ sub git_blob {
> >  			$line = untabify($line);
> >  			printf qq!<div class="pre"><a id="l%i" href="%s#l%i" class="linenr">%4i</a> %s</div>\n!,
> >  			       $nr, esc_attr(href(-replay => 1)), $nr, $nr,
> > -			       $syntax ? sanitize($line) : esc_html($line, -nbsp=>1);
> > +			       $highlight ? sanitize($line) : esc_html($line, -nbsp=>1);
>
> Oh, well.  It looks like checking if highlighter could be run in
> run_highlight() is wrong, as the caller (that is, git_blob()) needs
> to know if it is using "highlight" output (which is HTML) or raw blob
> contents (which needs to be HTML-escaped).

Per previous comment, run_highlight() is right, and we use the same
condition here to know if the highlight binary was used. If highlight
was run with --force and did not detect a language in the shebang, it
still outputs html but without adding the highlight markup.

>
> >  		}
> >  	}
> >  	close $fd
> > diff --git a/t/t9500-gitweb-standalone-no-errors.sh b/t/t9500-gitweb-standalone-no-errors.sh
> > index e94b2f1..9e5fcfe 100755
> > --- a/t/t9500-gitweb-standalone-no-errors.sh
> > +++ b/t/t9500-gitweb-standalone-no-errors.sh
> > @@ -702,12 +702,20 @@ test_expect_success HIGHLIGHT \
> >  	 gitweb_run "p=.git;a=blob;f=file"'
> >
> >  test_expect_success HIGHLIGHT \
> > -	'syntax highlighting (highlighted, shell script)' \
> > +	'syntax highlighting (highlighted, shell script shebang)' \
>
> It would be nice to have in test name that it checks if highlighter
> autodetection works, or at least doesn't crash gitweb.

I've updated it to:
syntax highlighting (highlighter language autodetection)
I'm happy to use any suggestion you have.

>
> >  	'git config gitweb.highlight yes &&
> > -	 echo "#!/usr/bin/sh" > test.sh &&
> > -	 git add test.sh &&
> > -	 git commit -m "Add test.sh" &&
> > -	 gitweb_run "p=.git;a=blob;f=test.sh"'
> > +	 echo "#!/usr/bin/sh" > test &&
> > +	 git add test &&
> > +	 git commit -m "Add test" &&
> > +	 gitweb_run "p=.git;a=blob;f=test"'
> > +
> > +test_expect_success HIGHLIGHT \
> > +	'syntax highlighting (highlighted, header file)' \
>
> Do we check explicit syntax knowledge (based on the extension),
> or autodetect again?

We have explicit syntax knowledge here. My thinking was this would
modify the existing test so that it highlights a different language than
the autodetected one, but the patch is simpler if I just make the
autodetcted one be a different language. I've done that in v2.

>
> > +	'git config gitweb.highlight yes &&
> > +	 echo "#define ANSWER 42" > test.h &&
> > +	 git add test.h &&
> > +	 git commit -m "Add test.h" &&
> > +	 gitweb_run "p=.git;a=blob;f=test.h"'
> >
> >  # ----------------------------------------------------------------------
> >  # forks of projects
> >
>
> Thank you for your work on this patch,
> --
> Jakub Narębski

Thank you for reviewing it!




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]