Re: html parser tutorial

"Richard Lynch" <ceo@xxxxxxxxx> · Tue, 7 Dec 2004 10:10:42 -0800 (PST)

Ahmed Abdel-Aliem wrote:
> Doesn anyone plz knows a good tutorial for parsing html files ?
> i have a html page and i want to parse information from it to insert
> it into mysql.
> i have a good experience in php, but i didn't write a parser before.
> can anyone help plz ?

TidyHTML is supposed to be good at that.  Never actually tried it, but
John Coggeshall's presentation a few months ago at the Chicago PHP User
Group meeting was pretty compelling.

If you only need a few small bits of information from web pages whose
format doesn't change often, you can maybe get it done really fast and
easy with http://php.net/explode.

I've scraped a lot of stuff that way myself.

You simply have to search the HTML for a distinctive tag that is unlikely
to change often and is shortly before the content you want.

Then use http://php.net/explode with that tag.  For example, on a site
with calendar events, you might use:

<?php
  $file = file('http://example.com/');
  $html = implode('', $file);
  $parts = explode('<td class="event_date"', $html);
  while (list(, $event) = each($parts)){
    list($date, $speaker, $description) = explode('</td>', $event);
    //Prepend <td because we stripped it off in 'explode' 3 lines above
    $date = strip_tags("<td $date");
    $speaker = strip_tags($speaker);
    $description = strip_tags($description);
    //Double-check the data as a valid date,
    //maybe even speaker/description as non-empty
    //and either log error or insert to your database
  }
?>

MOST sites with content you want to scrape on a routine basis are pretty
predictable.  CSS classes can be particularly useful to find the right
bits you want to scrap.

Occasionally I run across one where it's hand-edited and completely
unpredictable -- and usually not worth scraping, in my experience.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php