> From: John Stracke <jstracke@centivinc.com> > >>That would be less somewhat useful in this case, though, since each user > >>has their own table of keywords. > > >That contradicts other assumptions about this mechanism > > > Whose? The author of the original article was very explicit that he was > advocating users have individual tables. please read what I wrote instead of what you wish I had written. > ... > >be related to future samples from new spammers? If spam is uniform, > >then why do users need private tables of keywords? > > > I think it was to reduce false positives--because the profile of > different users' legitimate mail is nonuniform. I think the purpose is to reduce false negative at least as much as to reduce false positives. As far as I can tell, the nature of the system makes it at best difficult to adjust per-user tables to reduce false positives. (The standard usage is that a "false positive" is rejected legitimate mail while a "false negative" is spam that leaks past a filter.) I've assumed that if implemented in production, the system would use a corpus of manually identified spam and that the system would automatically recompute the scoring with only the the samples from the last year or so. > >The major problem is that the mechanism requires a significant and > >continuing false-negative rate to keep the scoring tuned as spammers > >come and go. > > > I dunno; keeping the tuning up to date sounds like a strength to me. It > requires some level of effort, but a much lower level than deleting > every piece of spam by hand. That's a straw man. Of course it's good to keep your filters up to date, but there are many other tactics that require less work of individuals and fewer false positive than this scheme. The reason the spam problem exists is that more than 99.99% of users cannot be bothered to report spam to ISPs. This scheme requires false positives, probably at least 5% or 10%. That's a lot of spam users would have to read compared to some other tactics. > >Of course, the main problem with any and every such system is that it > >is looking for characteristics other than "unsolicited" and "bulk." > > > Yes, and the main problem with the DCC is that it does not. > > When I moved last fall, I went through old mail, harvested the addresses > of old friends, and sent out mail with my new address. Some of these > people had never received email from me (they and I were CC:ed on the > same messages from other friends), so I would not have been on their > whitelist. I don't know how many people I sent to, but it was certainly > more than 10--which you say counts as bulk. So, if at least 10 of those > people had been using the DCC, then my message would have been tagged as > UBE, and some of them would not have gotten it. I suppose one might > argue this message was bulk email, but I knew every one of those people > personally, considered them friends (even if I hadn't seen them since > college), and had reason to believe that they would be at least somewhat > pleased to keep track of me. Why should that be filtered? If those friends had send mail to you, you might well be on their whitelists. If they had never sent mail to you, and since you had never sent mail to them, then why would you presume to clutter their mailboxes with news of your move? One good reason to filter such mail is that contrary to your hope, it would have been viewed as "spam" or at least useless by many recipients in similar situations, albeit not necessarily those friends of yours. Most of us receive more than enough "new address" mail from people we don't know very well and have never sent mail to. Another reason to filter such mail is that in general it is useless noise even from the point of view of its sender, if the sender sets asside the normal human perspective of being the unique center of the universe. It is useless noise because unless you send only a very little mail, because you cannot hope to reach more than a small fraction of your correspondents with your change of address notice. The only workable tactics to deal with moving is to make new friends, hope your old friends can track you down, or to get a permanent address. Such change of address mail is usually motiviated by the same human frailty that causes spam. Everyone thinks that spam is something that other people do. A better way to summarize the main problem with the DCC is that it requires the use of per-user or at least per-enterprise white lists. It is not a small problem. > I'm not advocating the Bayesian approach as a silver bullet, mind you; > but I think it's an interesting area to look into. Even if it doesn't > work, the general idea of filtering based on personalized statistics > could lead to something that works better. We agree about that. I'm irritated by the hype this notion has received, such as the use of the phrase "Bayesian approach" to imply it is a revolutionary invention. That phrase has some relevance for tuning the scoring but more as a formal description and about computerizing what people have been doing informally and manually for years. Vernon Schryver vjs@rhyolite.com