On Wed, 25 Feb 2004 gnulinux@xxxxxxxxxxx wrote: > i have ~98% accuracy thanks to bayesian filtering. i > haven't calculated my false positive rate, but i get > false positives. even *one* false positive is > unacceptable. even if my filter accuracy was 99.99% i > would still need to trawl my spam folder to check for > false positives. and as the spam volume continues to > grow trawling the spam folder takes more and more > time. i need to stop false positives and digital > signatures are one possible solution. But they aren't. They won't stop false positives and they won't keep you from having to traverse your spam folder if your criterion is "no false positives". As has been pointed out, it is effectively impossible to make a filter that passes only signal and no noise, only desired communications and never undesired (especially given that the final decision is essentially human and subjecting and too complex for ANY filtering tool but your own mind). There is real math behind this, but it is really a common sense conclusion as well. So if your criterion is "no false positives" (a goal that is up there with a 100% efficient heat engine, winning the publisher's clearing house sweepstakes, and a free lunch:-), not only will you need to trawl your spam folder, you'll need to trawl it repeatedly, as you yourself probably make mistakes on what is spam and what isn't one in 10^4 or 10^5 times, if not more, especially on a rapid scan of hundreds of messages almost all of which are spam. It has been pointed out several times now that unless you are willing to receive mail only from a small, closed group of individuals that all agree to use digital signatures and whose mail you whitelist while blacklisting EVERYTHING ELSE you are right back where you are right now. Since you don't blacklist everything else, it gets filtered (or not) and you have to go through the rejects in a final pass. Who knows, one of them might be your long lost cousin Jimmy, trying to get in touch with you but alas not possessed of either your phone number, a digital key and knowledge of how to use it, and hence the means to communicate it to you and hence get on your whitelist. It requires an out of band communication for him to be admitted to your channel, and if you reject everything not in your channel, Jimmy's out of luck. Then there is the cruel fact that you aren't going to convince all the list managers in the world to digitally sign their list traffic. You can of course decide never to subscribe to a list that doesn't, but that throws a whole lot of baby out with the bath water, and again leaves you trawling rejects. The only issue in controlling spam is whether one BOTHERS to trawl the rejects. This, in turn, is related to the level at which you reject spam -- the ratio between false negatives (spam that makes it to your regular spool) and false positives (mail that makes it into your spam spool). I set my spam filtering high enough that I don't check the rejects. I've spot-checked the rejects for quite some time, and any "real mail" that gets rejected is VERY likely to be something I will survive if I never get, and if it IS important and for any reason the sender is a friend they will almost certainly send me an out of band communication in my mostly OPEN filter channel saying "Hey, why didn't you respond to my note letting you know that you won the sweepstakes in Nigeria and are about to be presented with a check by the grandson of its prime minister? I put a dozen URL's to the website announcing the result into the message when I sent it from my laptop through the open WiFi net I happened to be passing by..." Others set the filter lower, but quick-scan the rejects. Still others might set one breakpoint VERY high (and reject 70% of it out of hand with "no" false positives), set a second low enough that no spam at all makes it into their regular spool, and quick scan the intermediate rejects. Just by examining the SA rating of spam that makes it through and the rating of regular mail that makes it through, it is pretty easy to see whether or not your active boundary is reasonable. Digital signatures won't change this a bit. It may permit you to identify a certain class of true negatives (and keep them out of the false positive bin), although I at least am cynical about even that. It won't keep you from having a wide range of mail that you still have to filter and still have to review, if your criterion is "no false positives" and you want to remain globally accessible to mail from strangers, and the additional work it will require to implement so that you can even afford to implement it at all greatly exceeds the work you're doing now, or could be doing with a bit of rearrangement. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@xxxxxxxxxxxx