Re: Re: I need help with url_regex

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Fri, 10 Sep 2010 18:26:34 +1200

On 10/09/10 09:17, devlin7 wrote:

Thanks Amos for the feedback.

It must be that I am entering it incorrectly because anything with a * or ?
doesn't work at all.

Are you sure that the "." is treated as "any character"

I am. In posix regex...
 "." means any (single) character.
 "*" means any zero or more of the previous item.
 "*" means any one or more of the previous item.
 "?" means zero or one of the previous item.
 "\" means treat the next character as exact, even if its usually special.

by "item" above I mean one character or a whole bracketed () thing.

To be matched as part of the source text the reserved characters all 
need to be escaped like \? in the pattern.

I would have thought that blocking .info would block any site that had .info
in it like www.porn.info but from what you are saying it would also block
www.sinfo.com. Am I correct?

Yes. These also-rans are most of the problem for this type of config.

So is there a beetter way?

Yes, a few. Breaking the denial into several rules will help do it 
faster and more precisely.

In most cases you will find you can do away with the regex part entirely 
and ban a whole domain. This way you can also search online and download 
lists of proxy domains to block wholesale. It's far easier than trying 
to build the list yourself. SquidGuard, DansGuardian, ufdb tools provide 
some lists like this. Also RHSBL anti-spam lists often include open 
proxy domains.

Some matches you can limit to only trying the matching on certain 
domains and doing the regex on only the path portion of the URL 
(urlpath_regex matches path+query string):

  acl badDomains dstdomain .example.com .info
  acl urlPathRegex urlpath_regex ^/browse\.php \.php\?q= \.php\?u=i8v
  http_access deny badDomains urlPathRegex

There will be some patterns which detect certain types of broken CMS 
(usually the search component "\?q=" like I mentioned) which act like a 
proxy even if they were not intended that way. Doing a urlpath_regex 
without the domain protection above is needed to catch many site using 
these CMS. Just be sure of and careful with the patterns.

NP: Ordering your rules in the same order I've named them above will 
even provide some measure of speed gain to the proxy. dstdomain is 
rather fast matching, regex is slow and resource hungry.

To backup everything you need reliable management support behind the 
blocking policy. With stronger enforcement for students caught actively 
trying to evade it. Without those you are is the sad position of an 
endless race.

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.8
  Beta testers wanted for 3.2.0.2