On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote: > On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote: > > On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote: > > > > > Optimizing 1000 x "www.foo.bar/<randomstuff>" into a _single_ > > > "www.foobar.com/(r(egex|and(om)?)|fuba[rz])" regex is nowhere near linear. > > > Even if it's all random servers, there are only ~30 characters from which > > > branches are created from. > > > > Right. > > > > Would be interesting to see how 50K dstdomain compares to 50k host > > patterns merged into a single dstdomain_regex pattern in terms of CPU > > usage. Probably a little tweaking of Squid is needed to support such > > large patterns, but that's trivial. (squid.conf parser is limited to > > 4096 characters per line, including folding) > > Not sure what the splay code does in Squid, didn't have time to grab it. > But a simple test with Perl: > > - Grepped some hostnames from wwwlogs etc > - Regexp::Assemble'd 50000 unique hostnames (= 560kB regex, took 22 sec) > - Run 100000 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) > > It's pretty powerful stuff. Oops, did it even slightly wrong. By doing it correctly, using ^hostname$ instead of plain hostname in regex results in 1.2 seconds, that's 80000+ hosts/sec..