Re: regex help

Jochem Maas <jochem@xxxxxxxxxxxxx> · Fri, 14 Jan 2005 00:51:46 +0100

Jason Morehouse wrote:
Hello,

I normally can take a bit of regex fun, but not this time.

Simple enough, in theory... I need to match (count) all of the bold tags 
in a string, including ones with embedded styles (or whatever else can 
go in there).  <b> and <b style="color:red">.  My attempts keep matching 
<br> as well.

okay, you didn't show the regexp you currently have no worries - I 
happen to have struck the same problem about 9 months ago when I had to 
screenscrape product info from a static site for importation into a DB,

heres a list of regexps which will hopefully give you enough info to

do what you want (the fifth regexp is the one you should look at most 
closely):

// strip out top and bottom

$str = preg_replace('/<[\/]?html>/is','',$str);

// strip out body tags

$str = preg_replace('/<[\/]?body[^>]*>/is','',$str);

// strip out head

$str = preg_replace('/<head>.*<[\/]head>/Uis','',$str);

// strip out non product images

$str = 
preg_replace('/<img[^>]*(nieuw|new|euro)\.gif[^>]*\/?>/Uis','',$str);

// strip out font, div, span, p, b

$str = preg_replace('/<[\/]?(font|div|span|p|b[^r])[^>]*>/Uis','',$str);

// table, td, tr attributes

$str = preg_replace('/<(table|td|tr)[^>]*>/Uis','<$1>',$str);

// strip out the first table and hr?

$str = preg_replace('/<table>.*<hr>/Uis','',$str, 1);

// strip table, td, tr

$str = preg_replace('/<[\/]?(table|td|tr|h5)>/Ui','',$str);

// strip out all new lines

$str = str_replace("\n", '', $str);

// strip out tabs

$str = preg_replace('/[\011]+/', ' ', $str);

// strip out extra white space

$str = preg_replace('/[  ]+/', ' ', $str);

Thanks!

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php