Re: Re: Parsing HTML

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, February 16, 2006 1:20 pm, Boby wrote:
> Jay Blanchard wrote:
>>> I need to extract news items from several news sites.
>  >> ...
>>> Can anybody please give me some pointers?
>>
>> Can you be more specific here? This is awfully broad.
>
> I'll give an example:
>
> Let's say I want to extract some news-items from the www.CNN.com web
> page (If you visit CNN's page, you can see the 'MORE NEWS' block at
> the
> right side).
>
> I know how to extract the news-items (or any other data in the page)
> using regular expressions, but I wonder if there are other ways. The
> code I'm writing will be maintained by other people in the future, and
> perhaps regular expressions won't be easy for them to update when the
> site changes its format.
>
> Can somebody please give me a short overview of the different ways of
> extracting data from HTML?
>
> I hope my question is clear enough now.

First, bypass the problem entirely, and look for an RSS / XML feed of
the same content.

Or a SOAP source of the content.

Or RPC.

Or WebDAV.

Or *anything* but web-scraping and parsing HTML, which should be your
"last resort"

If Regex seems high-maintenance, consider using just strpos and
explode and other simple string functions.

Most of the time, what you want in web-scrape looks like this:

<html>
TONS OF CRAP YOU DON'T CARE ABOUT,
LINE AFTER LINE,
PAGE AFTER PAGE...

<div class="cnnBulletList"><div>&#8226;&nbsp;<a
href="/2006/TECH/science/02/16/greenland.glaciers.ap/index.html">Melting
of Atlantic glaciers speeds up</a><br></div><div>&#8226;&nbsp;<a
href="/2006/WORLD/meast/02/16/iraq.main/index.html">U.S.: Iraqi death
squad members detained</a><br></div><div>&#8226;&nbsp;<a
href="/2006/US/02/16/un.guantanamo/index.html">U.N.: Close Gitmo, free
or try detainees</a> | <a
href="javascript:cnnVideo('play','/video/world/2006/02/16/oakley.un.guantanamo.latest.cnn','2006/02/23');"><img
src="http://i.a.cnn.net/cnn/.element/img/1.3/misc/icon.wd.watch.white.gif";
alt="WATCH" width="39" height="14" hspace="0" vspace="0" border="0"
class="cnnWatchBtn"></a><br></div><div>&#8226;&nbsp;<a
href="/2006/POLITICS/02/16/cheney.ap/index.html">Bush OK with Cheney's
story on shooting</a> | <a
href="javascript:CNN_openPopup('/interactive/allpolitics/0602/gallery.cheney.accident2/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');">Shooting
timeline</a><br></div><div>&#8226;&nbsp;<b><b>360&#176; Blog:</b> </b>
<a href="/CNN/Programs/anderson.cooper.360/blog/">Massive morgue
outside New Orleans rests in peace</a><br></div><div>&#8226;&nbsp;<a
href="/2006/POLITICS/02/16/congress.ports.ap/index.html">White House
urged to review sale of U.S. ports to Arab
firm</a><br></div><div>&#8226;&nbsp;<b><span class="cnnWOOL">Watch:
</span></b> <a
href="javascript:cnnVideo('play','/video/us/2006/02/16/helton.rebuilding.the.bayou.cnn','2006/09/01');">Volunteers
build hope, homes on the bayou</a> | <a
href="/2006/US/02/16/helton.habitat/index.html">Read</a><br></div><div>&#8226;&nbsp;<b><span
class="cnnWOOL">Watch: </span></b> <a
href="javascript:cnnVideo('play','/video/offbeat/2006/02/16/nandy.oh.funeral.home.wedding.affl','2006/02/23');">Rings?
Check. Vows? Check. Caskets? Check.</a><br></div><div>&#8226;&nbsp;<a
href="/2006/SHOWBIZ/Movies/02/16/film.bond.reut/index.html">New Bond
film finds its villain</a><br></div><!--PIPELINE BULLET

LOTS OF FOOTER CRAP YOU DON'T CARE ABOUT
</html>


<?php
$html = file_get_contents("http://www.cnn.com";);
$bullets = explode("<div class=\"cnnBulletList\">", $html);
$bullets = $bullets[1]; //0 element is header stuff we don't care about
$bullets = explode("<!--PIPELINE BULLET", $bullets);
$bullets = $bullets[0]; //1 element is footer crap we don't care about

$bullets = explode("<div>", $bullets);
unset($bullets[0]); //The initial cnnBulletList div tag we used above
foreach($bullets as $bullet){
  echo htmlentities($bullet);
}
?>

You now have each of the "More News" items separated out, and can tear
them apart into individual elements pretty much the same way.

explode on "<a" and "</a>" to get the links, etc.

It's not as fast or as pretty as preg, but any programmer can stumble
through it and figure it out, with no Regex skills at all.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux