On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote: > On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote: > > > Optimizing 1000 x "www.foo.bar/<randomstuff>" into a _single_ > > "www.foobar.com/(r(egex|and(om)?)|fuba[rz])" regex is nowhere near linear. > > Even if it's all random servers, there are only ~30 characters from which > > branches are created from. > > Right. > > Would be interesting to see how 50K dstdomain compares to 50k host > patterns merged into a single dstdomain_regex pattern in terms of CPU > usage. Probably a little tweaking of Squid is needed to support such > large patterns, but that's trivial. (squid.conf parser is limited to > 4096 characters per line, including folding) Not sure what the splay code does in Squid, didn't have time to grab it. But a simple test with Perl: - Grepped some hostnames from wwwlogs etc - Regexp::Assemble'd 50000 unique hostnames (= 560kB regex, took 22 sec) - Run 100000 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) It's pretty powerful stuff.