We have a support ticket system we built and customers can reply via email which then posts their reply into our database. The problem is that when you read a ticket, you see each ticket entry (row in DB) but they tend to accumulate the previous entries text since the customer replied to an email. A thread if you will. I'm trying to strip out the "duplicate parts" (cosmetically on the front end via a checkbox, in case the support person needs to see the actual unaltered version such as in cases where the algorithm may be too aggressive and rip out important pieces inadvertently). One challenge I'm running into are situations like this, where the text is embedded but has been slightly altered. ENTRY 1: For security and confidentiality reasons, we request that all subscribers who are requesting cancellation do so via the website of the company billing their account. You can easily cancel your membership on our billing agent website (just in case THIS PHP list software mangles the above, it is just one long string with no CR breaks as the ones below have) ENTRY 2: (which was mangled by the customer's email client most likely and formatted for 72 chars) For security and confidentiality reasons, we request that all subscribers who are requesting cancellation do so via the website of the company billing their account. You can easily cancel your membership on our billing agent website This is a simple example, but the solution logic might extend to other things such as perhaps a prefix like so: ENTRY 3: (again mangled by email client to prefix with ">" marks) > For security and confidentiality reasons, we request that all > subscribers who are requesting cancellation do so via the website of > the company billing their account. You can easily cancel your > membership on our billing agent website Keep in mind those blobs of text are often embedded inside other text which I *do* want to display. Initially I was thinking that somehow I could use a simple regex on the needle and haystacks to strip out all white space and str_ireplace() them that way, but then I don't have a way to put the whitespace back that I can see. Currently I'm just sort of brute forcing it and comparing the current message to previous ones and if the previous message is found in this message, then blank it out. But this only works of course if they are identical. <?php $i = 0; //the initial ticket message is in a different table than the replies hereafter $entry_message[$i] = $my_ticket->get_message(false); foreach($my_ticket->get_entries() as $eid => $entry) { $i++; $output_message = $entry_message[$i] = trim($entry['message']); //var_dump('OUTPUT MESSAGE:', $output_message); for ($j = ($i - 1); $j >= 0; --$j) { //echo "\n<br><font color='green'><b>searching for entry_message[$j] in [i = $i]:</b><br>\n$output_message</font><br>\n"; $output_message = str_replace($entry_message[$j], '', $output_message); //var_dump('NEW OUTPUT MESSAGE:', $output_message); } ( ^ you have to start from the bottom up like that or else you have altered your $output_message so subsequent matches fail ^ ) Would these be helpful? http://us2.php.net/manual/en/function.similar-text.php http://us2.php.net/manual/en/function.levenshtein.php http://us2.php.net/manual/en/function.soundex.php http://us2.php.net/manual/en/function.metaphone.php It seems like similar_text() could be, and if it's a high percentage, consider it a match, but then how do I extract that part from the source string, since str_replace() requires an exact match, not fuzzy. I am also thinking maybe something with preg_replace() where I break up the source string and take the first word(s) and last word(s) and use .*? in between, but that has its' own challenges for example... /For .*? website/ On this text doesn't do the match I really want (it stops on the second line)... For security and confidentiality reasons, we request that all subscribers who are requesting cancellation do so via the website of the company billing their account. You can easily cancel your membership on our billing agent website More stuff goes here website By putting more words before and after the .*? I could get better accuracy, but that is starting to feel hacky or fragile somehow. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php