How do I remove a string from another string in a fuzzy way?

"Daevid Vincent" <daevid@xxxxxxxxxx> · Mon, 20 May 2013 19:17:53 -0700

We have a support ticket system we built and customers can reply via email
which then posts their reply into our database. The problem is that when you
read a ticket, you see each ticket entry (row in DB) but they tend to
accumulate the previous entries text since the customer replied to an email.
A thread if you will.

I'm trying to strip out the "duplicate parts" (cosmetically on the front end
via a checkbox, in case the support person needs to see the actual unaltered
version such as in cases where the algorithm may be too aggressive and rip
out important pieces inadvertently).

One challenge I'm running into are situations like this, where the text is
embedded but has been slightly altered.

ENTRY 1:

For security and confidentiality reasons, we request that all subscribers
who are requesting cancellation do so via the website of the company billing
their account. You can easily cancel your membership on our billing agent
website

(just in case THIS PHP list software mangles the above, it is just one long
string with no CR breaks as the ones below have)

ENTRY 2: (which was mangled by the customer's email client most likely and
formatted for 72 chars)

For security and confidentiality reasons, we request that all
subscribers who are requesting cancellation do so via the website of
the company billing their account. You can easily cancel your 
membership on our billing agent website

This is a simple example, but the solution logic might extend to other
things such as perhaps a prefix like so:

ENTRY 3: (again mangled by email client to prefix with ">" marks)

> For security and confidentiality reasons, we request that all
> subscribers who are requesting cancellation do so via the website of
> the company billing their account. You can easily cancel your 
> membership on our billing agent website

Keep in mind those blobs of text are often embedded inside other text which
I *do* want to display.

Initially I was thinking that somehow I could use a simple regex on the
needle and haystacks to strip out all white space and str_ireplace() them
that way, but then I don't have a way to put the whitespace back that I can
see.

Currently I'm just sort of brute forcing it and comparing the current
message to previous ones and if the previous message is found in this
message, then blank it out. But this only works of course if they are
identical.

<?php
$i = 0;

//the initial ticket message is in a different table than the replies
hereafter
$entry_message[$i] = $my_ticket->get_message(false); 

foreach($my_ticket->get_entries() as $eid => $entry) 
{ 
	$i++;
	$output_message = $entry_message[$i] = trim($entry['message']);
	//var_dump('OUTPUT MESSAGE:', $output_message);

	for ($j = ($i - 1); $j >= 0; --$j)
	{
		//echo "\n<br><font color='green'><b>searching for
entry_message[$j] in [i = $i]:</b><br>\n$output_message</font><br>\n";
		$output_message = str_replace($entry_message[$j], '',
$output_message);
		//var_dump('NEW OUTPUT MESSAGE:', $output_message);
	}

( ^ you have to start from the bottom up like that or else you have altered
your $output_message so subsequent matches fail ^ )

Would these be helpful? 

http://us2.php.net/manual/en/function.similar-text.php
http://us2.php.net/manual/en/function.levenshtein.php
http://us2.php.net/manual/en/function.soundex.php
http://us2.php.net/manual/en/function.metaphone.php

It seems like similar_text() could be, and if it's a high percentage,
consider it a match, but then how do I extract that part from the source
string, since str_replace() requires an exact match, not fuzzy.

I am also thinking maybe something with preg_replace() where I break up the
source string and take the first word(s) and last word(s) and use .*? in
between, but that has its' own challenges for example...

  /For .*? website/

On this text doesn't do the match I really want (it stops on the second
line)...

  For security and confidentiality reasons, we request that all
  subscribers who are requesting cancellation do so via the website of
  the company billing their account. You can easily cancel your 
  membership on our billing agent website
  More stuff goes here website

By putting more words before and after the .*? I could get better accuracy,
but that is starting to feel hacky or fragile somehow.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php