> From: John Stracke <jstracke@centivinc.com> > Perry E. Metzger wrote: > ... > >The problem is that, naturally, the spammers will start running the > >tool over their spam before sending it and tweaking it until it > >passes. This has happened for previous techniques. > > > That would be less somewhat useful in this case, though, since each user > has their own table of keywords. That contradicts other assumptions about this mechanism and it points out a major problem. One assumption is that spam is a more rather than less uniformly distributed flood. If it is not uniform, how can you hope that the statistical characteristics of previous samples will be related to future samples from new spammers? If spam is uniform, then why do users need private tables of keywords? Spam distribution does have non-uniformities for language and character set, but as the Russian, Spanish, and Asian spam that English speakers receive shows, spammers try very hard to make spam distribution uniform. (As I've said, based on DCC counts and other data, I figure there there are fewer than 2000 whack-a-mole spammers on the net at any time, and there is an average 100% turnover very 6 months, with even the worst whack-a-mole spammers giving up after a few years. "Whack-a-mole" is the recognized technical term for the spammers that flog Viagra, porn, and loansharking as opposed to the big outfits like American Express and Dell Computers that never give up their "push advertising" campaigns.) The major problem is that the mechanism requires a significant and continuing false-negative rate to keep the scoring tuned as spammers come and go. For example, notice that the proposal talks about filtering based on "words" that are in fact domain names. A mail message today containing "cyberpromo" is almost certainly not spam, but almost certainly was spam not very long ago. What if the long established and very well known (except evidently to Mr. Graham) spam haus 263.com also reforms itself? You might answer that problem by having a centralized service run a lot of spam traps and distribute keyword scores. Brightmail has long been operating a system that does that, but I understand they distribute regular expressions and checksums instead of keyword scores. I'm not sure that difference is significant. Another issue is that spam traps get a distinct, atypical sample of spam. I used the word "scoring" instead of "probabilities" as others in this thread because I cannot see a major operational difference between the "probabilities" of this proposal and the "scoring" of systems such as SpamAssassin. Recall that SpamAssassin also computes a score based on the presence or absence of words, as well as other features. Using a computer program to tune your keyword SpamAssassin or similar scoring is not a bad idea, but it's not going to end spam as we know it. Of course, the main problem with any and every such system is that it is looking for characteristics other than "unsolicited" and "bulk." Vernon Schryver vjs@rhyolite.com