Re: Why Spam is a problem

John Stracke <jstracke@centivinc.com> · Mon, 19 Aug 2002 11:54:39 -0400

Vernon Schryver wrote:

>From: John Stracke <jstracke@centivinc.com>
>
>>That would be less somewhat useful in this case, though, since each user 
>>has their own table of keywords.
>>    
>>
>That contradicts other assumptions about this mechanism
>
Whose? The author of the original article was very explicit that he was 
advocating users have individual tables.

>and it points
>out a major problem.  One assumption is that spam is a more rather
>than less uniformly distributed flood.  If it is not uniform, how can
>you hope that the statistical characteristics of previous samples will
>be related to future samples from new spammers?  If spam is uniform,
>then why do users need private tables of keywords?
>
I think it was to reduce false positives--because the profile of 
different users' legitimate mail is nonuniform.

>The major problem is that the mechanism requires a significant and
>continuing false-negative rate to keep the scoring tuned as spammers
>come and go.
>
I dunno; keeping the tuning up to date sounds like a strength to me.  It 
requires some level of effort, but a much lower level than deleting 
every piece of spam by hand.

>Of course, the main problem with any and every such system is that it
>is looking for characteristics other than "unsolicited" and "bulk."
>
Yes, and the main problem with the DCC is that it does not.

When I moved last fall, I went through old mail, harvested the addresses 
of old friends, and sent out mail with my new address.  Some of these 
people had never received email from me (they and I were CC:ed on the 
same messages from other friends), so I would not have been on their 
whitelist.  I don't know how many people I sent to, but it was certainly 
more than 10--which you say counts as bulk.  So, if at least 10 of those 
people had been using the DCC, then my message would have been tagged as 
UBE, and some of them would not have gotten it.  I suppose one might 
argue this message was bulk email, but I knew every one of those people 
personally, considered them friends (even if I hadn't seen them since 
college), and had reason to believe that they would be at least somewhat 
pleased to keep track of me.  Why should that be filtered?

I'm not advocating the Bayesian approach as a silver bullet, mind you; 
but I think it's an interesting area to look into.  Even if it doesn't 
work, the general idea of filtering based on personalized statistics 
could lead to something that works better.

-- 
/===============================================================\
|John Stracke      |jstracke@centivinc.com                      |
|Principal Engineer|http://www.centivinc.com                    |
|Centiv            |My opinions are my own.                     |
|===============================================================|
|Both candidates are better than a megalomaniac mutant lab mouse|
|bent on world domination...but it's pretty close.              |
\===============================================================/