Re: Why Spam is a problem

Vernon Schryver <vjs@calcite.rhyolite.com> · Mon, 19 Aug 2002 09:08:44 -0600 (MDT)

> From: John Stracke <jstracke@centivinc.com>

> Perry E. Metzger wrote:

> ...
> >The problem is that, naturally, the spammers will start running the
> >tool over their spam before sending it and tweaking it until it
> >passes. This has happened for previous techniques.
> >
> That would be less somewhat useful in this case, though, since each user 
> has their own table of keywords.

That contradicts other assumptions about this mechanism and it points
out a major problem.  One assumption is that spam is a more rather
than less uniformly distributed flood.  If it is not uniform, how can
you hope that the statistical characteristics of previous samples will
be related to future samples from new spammers?  If spam is uniform,
then why do users need private tables of keywords?  Spam distribution
does have non-uniformities for language and character set, but as the
Russian, Spanish, and Asian spam that English speakers receive shows,
spammers try very hard to make spam distribution uniform.

(As I've said, based on DCC counts and other data, I figure there
there are fewer than 2000 whack-a-mole spammers on the net at any
time, and there is an average 100% turnover very 6 months, with even
the worst whack-a-mole spammers giving up after a few years.
"Whack-a-mole" is the recognized technical term for the spammers that
flog Viagra, porn, and loansharking as opposed to the big outfits
like American Express and Dell Computers that never give up their
"push advertising" campaigns.)

The major problem is that the mechanism requires a significant and
continuing false-negative rate to keep the scoring tuned as spammers
come and go.  For example, notice that the proposal talks about
filtering based on "words" that are in fact domain names.  A mail
message today containing "cyberpromo" is almost certainly not spam,
but almost certainly was spam not very long ago.  What if the long
established and very well known (except evidently to Mr. Graham)
spam haus 263.com also reforms itself?

You might answer that problem by having a centralized service run a lot
of spam traps and distribute keyword scores.  Brightmail has long
been operating a system that does that, but I understand they distribute
regular expressions and checksums instead of keyword scores.  I'm not
sure that difference is significant.  Another issue is that spam traps
get a distinct, atypical sample of spam. 

I used the word "scoring" instead of "probabilities" as others in this
thread because I cannot see a major operational difference between
the "probabilities" of this proposal and the "scoring" of systems such
as SpamAssassin.  Recall that SpamAssassin also computes a score based
on the presence or absence of words, as well as other features.  Using
a computer program to tune your keyword SpamAssassin or similar scoring
is not a bad idea, but it's not going to end spam as we know it.

Of course, the main problem with any and every such system is that it
is looking for characteristics other than "unsolicited" and "bulk."

Vernon Schryver    vjs@rhyolite.com