Re: Searching squid logs for pornographic sites

Chuck Kollars <ckollars9@xxxxxxxxx> · Thu, 12 Jun 2008 13:32:14 -0700 (PDT)

> The approach is problematic, especially when 
> using "three letter word" combinations, which match 
> arbitrary, harmless URLs.

The dreaded "unintended match in the middle of a word"
problem can torpedo most any approach. Solving it is
not a matter of changing approaches, but rather of
changing tools. It can adversely affect virtually any
approach; conversely it can be "fixed" in virtually
any approach. 

What's needed is a way to specify "word boundaries"
while regular expression matching. Unfortunately the
regular expression syntax for word boundaries varies
from tool to tool. Perl and its derivatives let you
specify \b at the beginning and/or end of a word (or
its opposite \B for not-a-word-boundary). Classic
`egrep` provides the same functionality but with the
different regular expression syntax \< at the
beginning of a word and \> at the end of a word. GNU
egrep, GNU awk, and GNU Emacs support both syntaxes.
Tcl provides word boundary functionality with \m, \M,
and \y. Both Java and .NET are Perl-like. The "-F"
command line switch turns GNU `grep` into its even
stupider cousin `fgrep`, neither of which let you
specify word boundaries in regular expressions at all.

GNU grep does however let you use Perl-style regular
expression by specifying "-P" on the command line. And
perhaps most importantly, GNU grep (and GNU egrep,
which is the same program with different switches)
lets you quickly and automatically turn _everything_
in your regular expressions into full words with the 
"-w" command line switch (lots of convenience, not
much control:-).

In summary: If you want to specify word boundaries
inside the regular expressions, use either Perl or GNU
grep -P or some other fairly modern tool. If you want
word boundary functionality withOUT specifying word
boundaries in the regular expressions themselves, use
GNU grep -w. If you have no other choice, you can make
it work with classic egrep by inserting \< and \>
appropriately in your regular expressions. But classic
grep won't do word boundaries no matter what. (You can
sorta fake it, but it's a lot of effort and it doesn't
work in all cases.) Note in particular that the
easy-to-overlook "-w" command line switch on GNU grep
can make a night/day difference. 

Please do let this list know your results after a few
months (It sounds like I'm not the only one that's a
bit skeptical that the "bad words in URL" approach
that seemed to work reasonably a couple years ago will
give even ballpark results these days...)

thanks!

-Chuck Kollars