Re: Hashing spam

Vernon Schryver <vjs@xxxxxxxxxxxxxxxxxxxx> · Thu, 18 Dec 2003 13:01:18 -0700 (MST)

> From: John Stracke <jstracke@xxxxxxxxxxx>

> >I work on an approach to block spam with a database of hash (md5) string of
> >spam email:

> ...
> It's been done, and the spammers have already evolved to get around it: 
> they randomize the messages so that the hashes don't match.

Unless you are mean naive and simplistic hashes, that is an overstatement.
As long as you want to accept mail from strangers, no spam filter can
perfectly predict whether copies of the next message from a stranger
are being sent to 30,000,000 of your intimate friends, but the various
hashing filters do some good work.

An estimate of the effectiveness of a large scale filter can be obtained
from what it sees as the spam ratio.  If it claims that 60% of all
mail is spam but the real ratio is 70%, then it must be 85% effective.

 ....

Concerning false positives for this mailing list--it would be wise to
define what mail is legitimate.  In many places, you must accept at
least 99.9% of all even remotely legitimate mail.  However, this context
is different.  Here a boolean "good/spam" is simplistic and wrong.
Instead we have a spectrum:
  1. on-topic messages from subscribers
  2. on-topic messages from non-subscribers
  3. noise from subscribers
  4. noise from non-subscribers
  5. pure spam such as advertisements for loan sharks

In this list, only #1 is clearly "good." It is good to avoid rejecting
#2, but there is surely no harm in sometimes delaying #2.  If the
senders of any rejected or "false positive" #2 received an informative
non-delivery report so that they could retransmit, what would be the harm?

SpamAssassin is reported to be better than 60% accurate.  #2 is surely
rare compared to #1.  Thus, as long as SpamAssassin white-lists all
subscribers, there would be no harm in the occasional rejection of #2.

Vernon Schryver    vjs@xxxxxxxxxxxx