John Stracke <jstracke@centivinc.com> writes: > The approach described looks only at the 15 words furthest from 0.5; > it seems likely that most messages that would rank at 0.9 or above > would have enough spam-words that words at 0.2 wouldn't show up. I missed that point. Random words indeed wouldn't work then. Guessing at `right' words still might. If the deployment results in spammers sending multiple copies of their spew with different sets of decoy words, the problem would actually get worse. One can imagine sets of decoy words for given demographics; e.g., for networking nerds: TCP MPLS duplex route BGP... One can imagine 1000 sets of decoy words for different categories of people, with each message sent by the spammer in 1000 copies (so, you might get it in 0 copies if you're very unusual -- or in 50 copies if you discuss fishing, computer networking, investing, travel, and such other categories in email regularly and the spammer has decoy lists for fishing, etc.). One can imagine software that then looks for word combinations in messages rather than individual words, making the state much greater and the spammer's job harder yet. The spammers would probably retort by using random subsets of decoy word sets. > One thing that would be necessary, and that the author doesn't > mention, would be to decode content-encodings before applying the > filter; otherwise spammers could just base64 all their messages. Even scarier: Spammers like to use Javascript even now to `encrypt' interesting parts of messages, such as URLs (this way, they make it harder to determine where a given web page is hosted). If Javascript works in the recipient's MUA, then you have a Turing-complete way of hiding. Rice theorem works against you. -- Stanislav Shalunov http://www.internet2.edu/~shalunov/ Sex is the mathematics urge sublimated. -- M. C. Reed.