RE: digital signature request

"Robert G. Brown" <rgb@xxxxxxxxxxxx> · Thu, 26 Feb 2004 09:05:42 -0500 (EST)

On Wed, 25 Feb 2004 gnulinux@xxxxxxxxxxx wrote:

> i have ~98% accuracy thanks to bayesian filtering.  i 
> haven't calculated my false positive rate, but i get 
> false positives.  even *one* false positive is 
> unacceptable.  even if my filter accuracy was 99.99% i 
> would still need to trawl my spam folder to check for 
> false positives.  and as the spam volume continues to 
> grow trawling the spam folder takes more and more 
> time.  i need to stop false positives and digital 
> signatures are one possible solution.

But they aren't.  They won't stop false positives and they won't keep
you from having to traverse your spam folder if your criterion is "no
false positives".  

As has been pointed out, it is effectively impossible to make a filter
that passes only signal and no noise, only desired communications and
never undesired (especially given that the final decision is essentially
human and subjecting and too complex for ANY filtering tool but your own
mind).  There is real math behind this, but it is really a common sense
conclusion as well.  So if your criterion is "no false positives" (a
goal that is up there with a 100% efficient heat engine, winning the
publisher's clearing house sweepstakes, and a free lunch:-), not only
will you need to trawl your spam folder, you'll need to trawl it
repeatedly, as you yourself probably make mistakes on what is spam and
what isn't one in 10^4 or 10^5 times, if not more, especially on a rapid
scan of hundreds of messages almost all of which are spam.

It has been pointed out several times now that unless you are willing to
receive mail only from a small, closed group of individuals that all
agree to use digital signatures and whose mail you whitelist while
blacklisting EVERYTHING ELSE you are right back where you are right now.
Since you don't blacklist everything else, it gets filtered (or not) and
you have to go through the rejects in a final pass.  Who knows, one of
them might be your long lost cousin Jimmy, trying to get in touch with
you but alas not possessed of either your phone number, a digital key
and knowledge of how to use it, and hence the means to communicate it to
you and hence get on your whitelist.  It requires an out of band
communication for him to be admitted to your channel, and if you reject
everything not in your channel, Jimmy's out of luck.  

Then there is the cruel fact that you aren't going to convince all the
list managers in the world to digitally sign their list traffic.  You
can of course decide never to subscribe to a list that doesn't, but that
throws a whole lot of baby out with the bath water, and again leaves you
trawling rejects.

The only issue in controlling spam is whether one BOTHERS to trawl the
rejects.  This, in turn, is related to the level at which you reject
spam -- the ratio between false negatives (spam that makes it to your
regular spool) and false positives (mail that makes it into your spam
spool).

I set my spam filtering high enough that I don't check the rejects.
I've spot-checked the rejects for quite some time, and any "real mail"
that gets rejected is VERY likely to be something I will survive if I
never get, and if it IS important and for any reason the sender is a
friend they will almost certainly send me an out of band communication
in my mostly OPEN filter channel saying "Hey, why didn't you respond to
my note letting you know that you won the sweepstakes in Nigeria and are
about to be presented with a check by the grandson of its prime
minister?  I put a dozen URL's to the website announcing the result into
the message when I sent it from my laptop through the open WiFi net I
happened to be passing by..."

Others set the filter lower, but quick-scan the rejects.  Still others
might set one breakpoint VERY high (and reject 70% of it out of hand
with "no" false positives), set a second low enough that no spam at all
makes it into their regular spool, and quick scan the intermediate
rejects.  Just by examining the SA rating of spam that makes it through
and the rating of regular mail that makes it through, it is pretty easy
to see whether or not your active boundary is reasonable.

Digital signatures won't change this a bit.  It may permit you to
identify a certain class of true negatives (and keep them out of the
false positive bin), although I at least am cynical about even that.  It
won't keep you from having a wide range of mail that you still have to
filter and still have to review, if your criterion is "no false
positives" and you want to remain globally accessible to mail from
strangers, and the additional work it will require to implement so that
you can even afford to implement it at all greatly exceeds the work
you're doing now, or could be doing with a bit of rearrangement.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@xxxxxxxxxxxx