On Saturday 30 September 2006 05:11, Chuck Kollars wrote: > Our experience with web filtering is the differences > in tools are _completely_ swamped by the quality and > depth of the blacklists. (The reverse of course is > also true: lack of good blacklists will doom_any_ > filtering tool.) > > We currently have over 500,000 (!) sites listed in > just the porn section of our blacklist. With quality > lists like these, any old tool will do a decent job. And large portions of those half million sites are probably already something different but porn sites or the domains were given up. I wouldn't judge the quality completely by the quantity. > Lots of folks need to get such lists reasonably and > regularly (quarterly?). Daily even. > Useful lists are far far too > large to be maintained by local staff. Probably what's > needed is a mechanism whereby everybody nationwide > contributes, some central site consolidates and > sanitizes, and then publishes the lists. I'd welcome such an effort. Some companies invest a lot of effort into URL categorisation - not just regarding porn sites. But they have several employees working full-time on that and run a kind of editor's office. For a free/open-source project you would need a lot of people and some mechanism (e.g. a web spider) that searches for further sites. And doing that job is boring. So compared to other free/open-source projects there is much less motivation to contribute constantly. > This would be a huge effort. It's not easily possible > even with lots of clever scripts and plenty of compute > power. We've already seen more than a handful of > "volunteers" swallowed up by similar efforts. I believe that the only blacklist that survived over the ages was http://urlblacklist.com/ - just that they are non-free now. I may be mistaken about its history though. There already exist DNS-based blacklists that are very effective for mail spam detection. Perhaps a DNS-based register where you can look up if a certain domain belongs to a certain category might help. Large installations like ISPs could mirror the DNS zone and private people could just use them. Perhaps even the Squid developers could support such a blacklist. So IMHO we lack both a source (volunteers, spider, web based contribution system) and a good way to use it. Huge static ACLs don't work well with Squid. Since I had to tell our managers at work on how well URL filtering works (we use a commercial solution) I pulled some numbers. And around 3,000 domains are registered at the DeNIC (german domain registry) alone every day. Now try that with other registries and get a rough number on many domains need to get categorized every day. That's the reason why it's so hard to create reasonable blacklists. (And also the cause for my rants where people expect decent filtering by just using the current publicly available blacklists). You didn't tell much about your intentions though. :) Kindly Christoph