Re: url_rewrite_program doesn't seem to work on squid 2.6 STABLE17

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Fri, 04 Jul 2008 11:56:43 +1200

Martin Jacobson (Jake) wrote:
Hi,

I hope that someone on this group can give me some pointers.  I have a squid proxy setup running version 2.6 stable 17 of squid.  I recently upgraded from a very old version of squid, 2.4 something.  The proxy sits in front of a search appliance and all search requests goes through the proxy.  

One of my requirements is to have all search requests for cache:SOMEURL go to a URL rewrite program that compares the requested URL to a list of URLs that have been blacklisted.  These URLs are one per line in a text file.  Any line that starts with # or is blank is discarded by the url_rewrite_program.  This Perl program seemed to work fine in the old version but now it doesn't work at all.  

Here is the relevant portion of my Squid conf file:
-------------------------------------------------------------------------------
http_port 80 defaultsite=linsquid1o.myhost.com accel

url_rewrite_program /webroot/squid/imo/redir.pl
url_rewrite_children 10

cache_peer searchapp3o.myhost.com parent 80 0 no-query originserver name=searchapp proxy-only
cache_peer linsquid1o.myhost.com parent 9000 0 no-query originserver name=searchproxy proxy-only
acl bin urlpath_regex ^/cgi-bin/
cache_peer_access searchproxy allow bin
cache_peer_access searchapp deny bin

Here is the Perl program
-------------------------------------------------------------------------------
#!/usr/bin/perl

$| = 1;

my $CACHE_DENIED_URL = "http://www.mysite.com/mypage/pageDenied.intel";;
my $PATTERNS_FILE = "/webroot/squid/blocked.txt";
my $UPDATE_FREQ_SECONDS = 60;

my $last_update = 0;
my $last_modified = 0;
my $match_function;

my $url, $remote_host, $ident, $method, $urlgroup;
my $cache_url;

my @patterns;

while (<>) {
   chomp;
   ($url, $remote_host, $ident, $method, $urlgroup) = split;

   &update_patterns();

   $cache_url = &cache_url($url);
   if ($cache_url) {
      &update_patterns();
      if (&$match_function($cache_url)) {
         $cache_url = &url_encode($cache_url);
         print "302:$CACHE_DENIED_URL?URL=$cache_url\n";
         next;
      }
   }
   print "\n";
}

sub update_patterns {
   my $now = time();
   if ($now > $last_update + $UPDATE_FREQ_SECONDS) {
      my @a = stat($PATTERNS_FILE);
      my $mtime = $a[9];
      if ($mtime != $last_modified) {
         @patterns = &get_patterns();
         $match_function = build_match_function(@patterns);
         $last_modified = $mtime;
      }
   }
}

sub get_patterns {
   my @p = ();
   my $p = "";
   open PATTERNS, "< $PATTERNS_FILE" or die "Unable to open patterns file. $!";
   while (<PATTERNS>) {
      chomp;
      if (!/^\s*#/ && !/^\s*$/) {    # disregard comments and empty lines.
         $p = $_;
         $p =~ s#\/#\\/#g;
         $p =~ s/^\s+//g;
         $p =~ s/\s+$//g;
         if (&is_valid_pattern($p)) {
            push(@p, $p);
         }
      }
   }
   close PATTERNS;
   return @p;
}

sub is_valid_pattern {
   my $pat = shift;
   return eval { "" =~ m|$pat|; 1 } || 0;
}

sub build_match_function {
   my @p = @_;
   my $expr = join(' || ', map { "\$_[0] =~ m/$p[$_]/io" } (0..$#p));
   my $mf = eval "sub { $expr }";
   die "Failed to build match function: $@" if $@;
   return $mf;
}

sub cache_url {
   my $url = @_[0];
   (my $script, $qs) = split(/\?/, $url);
   if ($qs) {
      my $param, $name, $value;
      my @params = split(/&/, $qs);
      foreach $param (@params) {
         ($name, $value) = split(/=/, $param);
         $value =~ tr/+/ /;
         $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack("C", hex($1))/eg;
         if ($value =~ /cache:([A-z0-9]{7,20}:)?([A-z]+:\/\/)?([^ ]+)/) {
            if ($2) {
               return $2 . $3;
            } else {
               # return "http://"; . $3;
               return $3;
            }
         }
      }
   }
   return "";
}

sub url_encode {
   my $str = @_[0];
   $str =~ tr/ /+/;
   $str =~ s/([\?&=:\/#])/sprintf("%%%02x", ord($1))/eg;
   return $str;
}

Below is a sample of the blocked URLs file
################################################################################
#
# URL Patterns to be Blocked
#---------------------------
# This file contains URL patterns which should be blocked
# in requests to the Google cache.
#
# The URL patterns should be entered one per line.
# Blank lines and lines that begin with a hash mark (#)
# are ignored.
#
# Anything that will work inside a Perl regular expression
# should work.
#
# Examples:
# http://www.bad.host/bad_directory/
# ^ftp:
# bad_file.html$
################################################################################
# Enter URLs below this line
################################################################################

www.badsite.com/

So my question, is there a better way of doing this?

You would be much better off defining this as an external_acl program 
and possibly using deny_info to do the 'redirect' when it blocks a request.
That way also, the ACL-lookup results can be cached in squid and reduce 
the server load doing url re-writes.

Does someone see anything wrong that is keeping this from working in 2.6?

Your old way should still have worked though. inefficient as it was.
The squid.conf snippet you have shown appears to be correct, there may 
be something elsewhere unexpectedly affecting it though.

Or the perl program may simply be failing to access its data file 
properly (remember squid changes its permissions down to a non-root user 
who need access to all resources.)

A view of the error should be helpful in tracking this down. That may be 
in syslog, or cache.log. Or by running the Perl program from command 
line as the squid effective user.

Amos
--
Please use Squid 2.7.STABLE3 or 3.0.STABLE7