Re: Google Safe Browsing API - Integration with squid?

Adrian Chadd <adrian@xxxxxxxxxxxxxxx> · Sun, 24 Jun 2007 10:18:18 +0800

On Sat, Jun 23, 2007, Andreas Pettersson wrote:

> I would :)
> However Phishtank publishes a full xml file which with some tweaking 
> could be converted into a plain text list of domains or urls for direct 
> use with squid.
> http://www.phishtank.com/blog/2006/10/17/xml-data-file-of-online-valid-phishes-from-phishtank/

Yup; I'm working on loading that into a hash for lookups, after normalising
the URLs (removing the protocol, user@password, anchor; can't remove the query
as some phish URLs bounce via well-known services like google, live.com, etc.)

> I'm not sure realtime lookups via the google or phishtank api could keep 
> up with caches serving over 100 requests/sec.

The lookups have to be done local by pushing server updates out, a la the
Google safebrowsing hash updates. If I get this stuff done I'm hoping the
phishtank guys will release diffs to their XML database file.

My code won't be using the live APIs, they'll download the XML database
(phishtank) and hash database (google) locally and load them into an external_acl
helper.

> By the way, haven't a DNSBL for this purpose been discussed previously?

DNSBLs require one of two things:

* a crapload of infrastructure to service the DNSBL, as you're talking about
  caches who could request thousands of requests a second; or
* local DNS zones which are then loaded into private DNS servers (like ordb
  used to do) so you can lookup against that.

Adding an extra few hundred millisecond per request is probably not going
to help the browsing experience.

Adrian