Re: Access control : How to block a very large number of domains

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Sat, 27 Jun 2009 16:07:40 +1200

hims92 wrote:
hello,
I performed the tests (to block sites using squidguard) with some less
domains but squid did not respond properly, that is the network got slow.

squid-2.5.STABLE11.tar
squidGuard-1.2.10.tar
Berkeley DB 4.2.52

number of domains in black list - 656490 (0.6 million) ; urls - 141581 (0.1
million)
Peak time requests - 200/sec

Squid 2.5 is rather old now. Even 2.6 is now obsolete.

Sounds like the dnsserver traffic cap being reached. That was solved in 
by adding or improvements to an internal DNS resolver in later versions.

Amos Jeffries-2 wrote:
On Mon, 15 Jun 2009 12:26:16 -0700 (PDT), hims92
<himanshu.singh.cse07@xxxxxxxxxxx> wrote:
Hi,
As far as I know, SquidGuard uses Berkeley DB (which is based on BTree
and
Hash tables) for storing the urls and domains to be blocked. But I need
to
store a huge amount of domains (about 7 millions) which are to be
blocked.
Moreover, the search time to check if the domain is there in the block
list,
has to be less than a microsecond.

So, Will Berkeley DB serve the purpose?

I can search for a domain using PATRICIA Trie in less than 0.1
microseconds.
So, if Berkeley Trie is not good enough, how can I use the Patricia Trie
instead of Berkeley DB in Squid to block the url.
Do do tests with such a critical timing you would be best to use an
internal ACL. Which eliminates networking transfer delays to external
process.

Can you a bit more specific how to do that; am pretty new to squid.

Are you fixed to a certain version of Squid?

No am not. But presently, my institution has :
squid-2.5.STABLE11.tar
squidGuard-1.2.10.tar
Berkeley DB 4.2.52

And would like to find the solution, if possible for these versions only.

Squid-2 is not bad to tweak, but not very easy to add to ACL either.

The Squid-3 ACL are fairly easy to implement and drop a new one in. You
can
create your own version of dstdomain and have Squid do the test. At
present
dstdomain uses unbalanced splay tree on full reverse-string matches which
is good but not so good as it could be for large domain lists.

How to create our own version of dstdomain?
Does the earlier versions(2.x) of squid also use unbalanced splay tree for
searching a url/domain or do they use linear search, binary search or some
other efficient search technique.

Ah, that I'm not sure of. I only joined the squid project 3 years ago. 
2.5 was way before my time. There was a lot of improvements during 2.6 
when peoples local 2.5 patches got merged apparently.

I'm still learning stuff about 2.5 as one would hear tales of walking 
disk drives when I was a student :)

Is it possible to may be store all the domains and urls (0.7 million approx)
in a vector (STL) and then perform binary_search to find the result of the
query?

That has been tried and found slower than the existing splay methods. 
Too much overhead in the STL.

I tested the binary_search in a stand alone cpp program, and the query time
was pretty satisfactory for me.

How does squid handle the requests for domain ips? Does it stores all domain
ips somewhere or first perform a dns lookup for the domain name and then
searches for whether its in deny/access list or not before giving access?

The access control lists are processed and converted to whatever native 
type they need at configure time (thus a domain name can be entered in 
'dst' type and will add all its current set of IPs to the dst list.)

During operation there is a fqdncache for rDNS results and ipcache for 
DNS results. When a domain needs converting its looked up there first 
then a DNS request is started if not found or the TTL has expired. Then 
when an IP is available its checked against the ACL list.

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE6 or 3.0.STABLE16
  Current Beta Squid 3.1.0.9