On Thu, February 16, 2006 1:20 pm, Boby wrote: > Jay Blanchard wrote: >>> I need to extract news items from several news sites. > >> ... >>> Can anybody please give me some pointers? >> >> Can you be more specific here? This is awfully broad. > > I'll give an example: > > Let's say I want to extract some news-items from the www.CNN.com web > page (If you visit CNN's page, you can see the 'MORE NEWS' block at > the > right side). > > I know how to extract the news-items (or any other data in the page) > using regular expressions, but I wonder if there are other ways. The > code I'm writing will be maintained by other people in the future, and > perhaps regular expressions won't be easy for them to update when the > site changes its format. > > Can somebody please give me a short overview of the different ways of > extracting data from HTML? > > I hope my question is clear enough now. First, bypass the problem entirely, and look for an RSS / XML feed of the same content. Or a SOAP source of the content. Or RPC. Or WebDAV. Or *anything* but web-scraping and parsing HTML, which should be your "last resort" If Regex seems high-maintenance, consider using just strpos and explode and other simple string functions. Most of the time, what you want in web-scrape looks like this: <html> TONS OF CRAP YOU DON'T CARE ABOUT, LINE AFTER LINE, PAGE AFTER PAGE... <div class="cnnBulletList"><div>• <a href="/2006/TECH/science/02/16/greenland.glaciers.ap/index.html">Melting of Atlantic glaciers speeds up</a><br></div><div>• <a href="/2006/WORLD/meast/02/16/iraq.main/index.html">U.S.: Iraqi death squad members detained</a><br></div><div>• <a href="/2006/US/02/16/un.guantanamo/index.html">U.N.: Close Gitmo, free or try detainees</a> | <a href="javascript:cnnVideo('play','/video/world/2006/02/16/oakley.un.guantanamo.latest.cnn','2006/02/23');"><img src="http://i.a.cnn.net/cnn/.element/img/1.3/misc/icon.wd.watch.white.gif" alt="WATCH" width="39" height="14" hspace="0" vspace="0" border="0" class="cnnWatchBtn"></a><br></div><div>• <a href="/2006/POLITICS/02/16/cheney.ap/index.html">Bush OK with Cheney's story on shooting</a> | <a href="javascript:CNN_openPopup('/interactive/allpolitics/0602/gallery.cheney.accident2/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');">Shooting timeline</a><br></div><div>• <b><b>360° Blog:</b> </b> <a href="/CNN/Programs/anderson.cooper.360/blog/">Massive morgue outside New Orleans rests in peace</a><br></div><div>• <a href="/2006/POLITICS/02/16/congress.ports.ap/index.html">White House urged to review sale of U.S. ports to Arab firm</a><br></div><div>• <b><span class="cnnWOOL">Watch: </span></b> <a href="javascript:cnnVideo('play','/video/us/2006/02/16/helton.rebuilding.the.bayou.cnn','2006/09/01');">Volunteers build hope, homes on the bayou</a> | <a href="/2006/US/02/16/helton.habitat/index.html">Read</a><br></div><div>• <b><span class="cnnWOOL">Watch: </span></b> <a href="javascript:cnnVideo('play','/video/offbeat/2006/02/16/nandy.oh.funeral.home.wedding.affl','2006/02/23');">Rings? Check. Vows? Check. Caskets? Check.</a><br></div><div>• <a href="/2006/SHOWBIZ/Movies/02/16/film.bond.reut/index.html">New Bond film finds its villain</a><br></div><!--PIPELINE BULLET LOTS OF FOOTER CRAP YOU DON'T CARE ABOUT </html> <?php $html = file_get_contents("http://www.cnn.com"); $bullets = explode("<div class=\"cnnBulletList\">", $html); $bullets = $bullets[1]; //0 element is header stuff we don't care about $bullets = explode("<!--PIPELINE BULLET", $bullets); $bullets = $bullets[0]; //1 element is footer crap we don't care about $bullets = explode("<div>", $bullets); unset($bullets[0]); //The initial cnnBulletList div tag we used above foreach($bullets as $bullet){ echo htmlentities($bullet); } ?> You now have each of the "More News" items separated out, and can tear them apart into individual elements pretty much the same way. explode on "<a" and "</a>" to get the links, etc. It's not as fast or as pretty as preg, but any programmer can stumble through it and figure it out, with no Regex skills at all. -- Like Music? http://l-i-e.com/artists.htm -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php