Re: [PATCH 11/18] gitweb: add isDumbClient() check

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 09 Dec 2010 16:12:33 -0800 (PST)

"John 'Warthog9' Hawley" <warthog9@xxxxxxxxxxxxxx> writes:

> Basic check for the claimed Agent string, if it matches a known
> blacklist (wget and curl currently) don't display the 'Generating...'
> page.
> 
> Jakub has mentioned a couple of other possible ways to handle
> this, so if a better way comes along this should be used as a
> wrapper to any better way we can find to deal with this.
> 
> Signed-off-by: John 'Warthog9' Hawley <warthog9@xxxxxxxxxxxxxx>
> ---
>  gitweb/lib/cache.pl |   30 ++++++++++++++++++++++++++++++
>  1 files changed, 30 insertions(+), 0 deletions(-)
> 
> diff --git a/gitweb/lib/cache.pl b/gitweb/lib/cache.pl
> index d55b572..5182a94 100644
> --- a/gitweb/lib/cache.pl
> +++ b/gitweb/lib/cache.pl
> @@ -116,6 +116,34 @@ sub isFeedAction {
>  	return 0;		# False
>  }
>  
> +# There have been a number of requests that things like "dumb" clients, I.E. wget
> +# lynx, links, etc (things that just download, but don't parse the html) actually
> +# work without getting the wonkiness that is the "Generating..." page.
> +#
> +# There's only one good way to deal with this, and that's to read the browser User
> +# Agent string and do matching based on that.  This has a whole slew of error cases
> +# and mess, but there's no other way to determine if the "Generating..." page
> +# will break things.
> +#
> +# This assumes the client is not dumb, thus the default behavior is to return
> +# "false" (0) (and eventually the "Generating..." page).  If it is a dumb client
> +# return "true" (1)
> +sub isDumbClient {

Please don't use mixedCase, but underline_separated words,
e.g. browser_is_robot(), or client_is_dumb().

> +	my($user_agent) = $ENV{'HTTP_USER_AGENT'};

What if $ENV{'HTTP_USER_AGENT'} is unset / undef, e.g. because we are
runing gitweb as a script... which includes running gitweb tests?

> +	
> +	if(
> +		# wget case
> +		$user_agent =~ /^Wget/i
> +		||
> +		# curl should be excluded I think, probably better safe than sorry
> +		$user_agent =~ /^curl/i
> +	  ){
> +		return 1;	# True
> +	}
> +
> +	return 0;
> +}

Compare (note: handcrafted solution is to whitelist, not blacklist):

+sub browser_is_robot {
+       return 1 if !exists $ENV{'HTTP_USER_AGENT'}; # gitweb run as script
+       if (eval { require HTTP::BrowserDetect; }) {
+               my $browser = HTTP::BrowserDetect->new();
+               return $browser->robot();
+       }
+       # fallback on detecting known web browsers
+       return 0 if ($ENV{'HTTP_USER_AGENT'} =~ /\b(?:Mozilla|Opera|Safari|IE)\b/);
+       # be conservative; if not sure, assume non-interactive
+       return 1;
+}

from

  "[PATCHv6 17/24] gitweb: Show appropriate "Generating..." page when regenerating cache"
  http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163040
  http://repo.or.cz/w/git/jnareb-git.git/commitdiff/48679f7985ccda16dc54fda97790841bab4a0ba2

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html