> From: John Stracke <jstracke@xxxxxxxxxxx> > >I work on an approach to block spam with a database of hash (md5) string of > >spam email: > ... > It's been done, and the spammers have already evolved to get around it: > they randomize the messages so that the hashes don't match. Unless you are mean naive and simplistic hashes, that is an overstatement. As long as you want to accept mail from strangers, no spam filter can perfectly predict whether copies of the next message from a stranger are being sent to 30,000,000 of your intimate friends, but the various hashing filters do some good work. An estimate of the effectiveness of a large scale filter can be obtained from what it sees as the spam ratio. If it claims that 60% of all mail is spam but the real ratio is 70%, then it must be 85% effective. .... Concerning false positives for this mailing list--it would be wise to define what mail is legitimate. In many places, you must accept at least 99.9% of all even remotely legitimate mail. However, this context is different. Here a boolean "good/spam" is simplistic and wrong. Instead we have a spectrum: 1. on-topic messages from subscribers 2. on-topic messages from non-subscribers 3. noise from subscribers 4. noise from non-subscribers 5. pure spam such as advertisements for loan sharks In this list, only #1 is clearly "good." It is good to avoid rejecting #2, but there is surely no harm in sometimes delaying #2. If the senders of any rejected or "false positive" #2 received an informative non-delivery report so that they could retransmit, what would be the harm? SpamAssassin is reported to be better than 60% accurate. #2 is surely rare compared to #1. Thus, as long as SpamAssassin white-lists all subscribers, there would be no harm in the occasional rejection of #2. Vernon Schryver vjs@xxxxxxxxxxxx