2009/5/29 Merlin Morgenstern <merlin_x@xxxxxxxxxxx>: > > > Per Jessen wrote: >> >> Merlin Morgenstern wrote: >> >>> Hi there, >>> >>> I am matching text against an array of keywords to detect spam. >>> Unfortunatelly there are some false positives due to the fact that >>> stripos also finds the keyword inside a word. >>> E.G. "Bewerbung" -> "Werbung" >>> >>> First thought: use strpos, but this does not help in all cases >>> Second thought: split text into words and use in_array, but this does >>> not find things like "zu Hause" or "flexible/Arbeit" >> >> First thought - use Spamassassin. >> Second thought - use regexes. >> >> /Per >> > > > sorry this is a different scneario. I do need to to it this way in my case. > It is about spam inside user postings. > > Any ideas? I've had to solve this problem before and the conclusion I came to is that when doing this kind of simple matching you either accept false positives or false negatives. Alternatives include implementing Bayesian filtering or some other algorithm that's more complex than simple matching or use a pre-existing solution. I'm sure you could integrate SpamAssassin or similar because at the end of the day all those systems expect is a bunch of text. If they require the headers of an email you can supply fake ones and remove any effect headers have on the score. Whether that's worth it depends on the volume your talking about and how much manual moderation checks you want to have to do. -Stuart -- http://stut.net/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php