Re: spambayes

James Wilkinson <fedora@xxxxxxxxxxxxxxxxxxx> · Wed, 10 May 2006 13:12:47 +0100

Aaron Konstam wrote:
> But and no one asked it is based on a mistaken assumption that it is
> useful to have mail identified in addition to spam and ham as unknown. I
> don't think they call it unknown but that is the purpose. I can't go
> into the whole argument but to me this tri-classification is not only
> unnecessary but more trouble to deal with.

I, on the other hand, find it excellent. The program has the honesty to
ask for help when it gets stuck.

What we'd all *like*, ideally, is an antispam program that could
identify what we considered to be spam with 100% accuracy.

That turns out to be practically impossible. There will be e-mails that
are border-line, e-mails that "look" like spam but are actually wanted
(false positives), e-mails that "look" wanted but are really spam (false
negatives), and ones that are pretty impossible to automatically
classify.

The "unsure" category provides a place for the border-line and the Hard
Cases, and massively reduces false positives and negatives (they usually
end up in "unsure", instead of "good" or "spam").

So you get "good" folders that you can be pretty certain are good. You
get "spam" folders that *very* *very* rarely have good e-mail in them.
And you have a folder *marked* "dodgy". So you can quickly deal with it
when you want, with the expectation that it's probably spam.

Of course, since the program is based on a modified Bayesian algorithm,
you are expected to train on errors. You are expected to put a little
bit of time into helping the program. "Unsure" is simply where e-mails
go if the program needs to be trained on them.

James.

-- 
E-mail address: james | "Today Has Been Two Of Those Days."
@westexe.demon.co.uk  |     -- Mike Andrews

-- 
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list