akonstam at sbcglobal.net
Wed May 10 15:07:07 UTC 2006
On Wed, 2006-05-10 at 13:12 +0100, James Wilkinson wrote:
> Aaron Konstam wrote:
> > But and no one asked it is based on a mistaken assumption that it is
> > useful to have mail identified in addition to spam and ham as unknown. I
> > don't think they call it unknown but that is the purpose. I can't go
> > into the whole argument but to me this tri-classification is not only
> > unnecessary but more trouble to deal with.
> I, on the other hand, find it excellent. The program has the honesty to
> ask for help when it gets stuck.
> What we'd all *like*, ideally, is an antispam program that could
> identify what we considered to be spam with 100% accuracy.
> That turns out to be practically impossible. There will be e-mails that
> are border-line, e-mails that "look" like spam but are actually wanted
> (false positives), e-mails that "look" wanted but are really spam (false
> negatives), and ones that are pretty impossible to automatically
> The "unsure" category provides a place for the border-line and the Hard
> Cases, and massively reduces false positives and negatives (they usually
> end up in "unsure", instead of "good" or "spam").
> So you get "good" folders that you can be pretty certain are good. You
> get "spam" folders that *very* *very* rarely have good e-mail in them.
> And you have a folder *marked* "dodgy". So you can quickly deal with it
> when you want, with the expectation that it's probably spam.
> Of course, since the program is based on a modified Bayesian algorithm,
> you are expected to train on errors. You are expected to put a little
> bit of time into helping the program. "Unsure" is simply where e-mails
> go if the program needs to be trained on them.
All this is too technical a matter to deal with here. Training on the
usure is not different then training on the ham which should be spam and
the spam that should be ham. In any case you can't be over confident
with spambayes. You still have to check for spam that is misclassified
and ham which is misclassified. So now you have three streams to check
rather than two. That to me is an extra pain.
Aaron Konstam <akonstam at sbcglobal.net>
More information about the users