Re: Real Killer App!

Nicholas Fitzgerald <nick@axelis.com> · Wed, 12 Mar 2003 14:45:33 -0800

Ok, I thought the index bit might be the problem since when I went to 
look I realized I hadn't created one for this database. I also noticed a 
problem with my raid array so I updated the driver and that went away. 
Still, no matter what, I get this same problem.  So here you go. Below 
is that entire prog as it now exists. Notice I have NOT configured it as 
yet to into the next level. I did this on purpose so I wouldn't have to 
kill it in the middle of operation and potencially scew stuff up. They 
way it is now, it looks at all the records in the database, updates them 
if necessary, then extracts all the links and puts them into the 
database for crawling on the next run through. Once I get this working 
I'll put a big loop in it so it keeps going until there's nothing left 
to look at. Meanwhile, if anyone sees anything in here that could be the 
cause of this problem please let me know!

<?php
require('../includes/config.inc');
global $robots, $keywords, $description, $title, $body, $url, $spiderday;
set_time_limit(0);

echo "##### The Spider is Running, Do Not Close This Console #####\n\n";

// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search");
while($rslt = mysql_fetch_array($sql)){
   $url = $rslt["url"];

// Open URL for parsing
   $open = @fopen("$url", "r");
   if($open){
       $read = fread($open, 100000);
       fclose($open);
   }
   else{
       $kill = mysql_query("DELETE FROM search WHERE url='$url'");
       continue;
   }

// Set date and checksum info

   $today = date("Y-m-d");

   $checksum = crc32($read);

   $chkyr = strftime(date("Y"));

   $chkmo = strftime(date("m"));

   $chkdy = strftime(date("d"));

   $daycheck = strftime("%Y-%m-%d", 
mktime(0,0,0,$chkmo,-$spiderday,$chkyr));

// Get meta tags and use get_meta_tags to check if the file is actually 
there

   $meta = @get_meta_tags($url);

   if(!$meta){

       $kill = mysql_query("DELETE FROM search WHERE url='$url'");

       continue;

   }

   $robots = $meta["robots"];

   $keywords = $meta["keywords"];

   $description = $meta["description"];

// Get the page title
   $temp = stristr($read,"<title>");
   $tpos = strlen($read) - strlen($temp);
   $temp = stristr($read,"</title>");
   $tend = strlen($read) - strlen($temp);
   $tchn = ($tend - $tpos);
   $title = strip_tags(substr($read, ($tpos+7),$tchn));

// Get the page body
   $body = str_replace("'","`",trim(strip_tags($read)));

// Put the stuff in the search database

   $puts = mysql_query("SELECT * FROM search WHERE url='$url'");

   $site = mysql_fetch_array($puts);

   $nurl = $site["url"];

   $ncrc = $site["checksum"];

   $ndate = $site["date"];

   if($ndate <= $daycheck || $ncrc != $checksum){

       echo "\n\nUpdating: $title\n$url\n";

       $renew = mysql_query("UPDATE search SET url='$url', 
title='$title', metak='$keywords', metad='$description', 
mrobot='$robots', body='$body', checksum='$checksum', date=CURDATE() 
WHERE url='$url'");

   }

   else{

       continue;

   }

// Parse the main URL
   $top = parse_url($url);
   $tschm = $top["scheme"];
   $thost = $top["host"];
   $tpath = $top["path"];
   $tqury = $top["query"];
   $tfrag = $top["fragment"];

// Parse all the links on the page

   $rtemp = stristr($read,"href");   
   $temp = stristr($rtemp,">");

   while($rtemp){

   // Parse the href out of the string

       $rtemp = stristr($temp,"href");   
       $lpos = strlen($rtemp) - strlen($temp);

       $temp = stristr($rtemp,">");

       $lend = strlen($rtemp) - strlen($temp);

       $alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6, 
($lend)))));

       $blink = stristr($alink," ");

       $alen = strlen($alink) - strlen($blink);

       $link = substr($alink, 0, $alen);

   // Kill any trailing slashes
       if(substr($link,(strlen($link)-1)) == "/"){
           $link = substr($link,0,(strlen($link)-1));
       }

   // Get rid of any garbage and most binary files in the link
       if(substr_count($link,"&?") != 0){
           continue;
       }

       if(substr_count($link,"@") != 0){
           continue;
       }

       if(substr_count($link,"javascript") != 0){
           continue;
       }

       if(substr_count($link,"mailto") != 0){

           continue;

       }

       if(substr_count($link,"jpg") != 0){

           continue;

       }

       if(substr_count($link,"gif") != 0){

           continue;

       }

       if(substr_count($link,"pdf") != 0){
           continue;
       }

       if(substr_count($link,"pnf") != 0){
           continue;
       }

       if(substr_count($link,"mpg") != 0){
           continue;
       }

       if(substr_count($link,"mpeg") != 0){
           continue;
       }

       if(substr_count($link,"av") != 0){
           continue;
       }

   // Parse the current link
       $bot = @parse_url($link);
       if(!$bot){
           continue;
       }
       $bschm = $bot["scheme"];
       $bhost = $bot["host"];
       $bpath = $bot["path"];
       $bqury = $bot["query"];
       $bfrag = $bot["fragment"];

   // Get rid of outside links
       if($bhost != "" && $bhost != $thost){
           continue;
       }

   // Kill off any dot dots "../../"
       $ddotcheck = substr_count($bpath,"../");
       if($ddotcheck != ""){
           $lpos = strrpos($bpath,"..");
           $bpath = substr($bpath,$lpos);
       }

   // Comparitive analisys
       if($bpath != "" && substr($bpath,0,1) != "/"){
           if(strrpos($tpath,".") === false){
               $bpath = $tpath . "/" . $bpath;
           }
           if(strrpos($tpath,".")){
               $ttmp = substr($tpath,0,(strrpos($tpath,"/")+1));
               $bpath = $ttmp . $bpath;
               if(substr($bpath,0,1) != "/"){
                   $bpath = "/" . $bpath;
               }
           }
       }

       if($bhost == ""){
           $link = $tschm . "://" . $thost . $bpath;
       }

   // Kill any trailing slashes
       if(substr($link,(strlen($link)-1)) == "/"){
           $link = substr($link,0,(strlen($link)-1));
       }

   // If there is a query string put it back on
       if($bqury != ""){
           $link = $link . "?" . $bqury;
       }

       if($link == ""){
           continue;
       }

   // Put the new URL in the search database

       $chk = mysql_query("SELECT * FROM search WHERE url = '$link'");

       $curec = mysql_fetch_array($chk);

       if(!$curec){

           echo "Adding: $link\n";

           $putup = mysql_query("INSERT INTO search SET url='$link'");

       }

       else{

           continue;

       }

   }

}

echo "\n\n##### The Spider is Finished, You Can Now Close This Console 
#####\n";

?>

Jim Hunter wrote:

Just a guess, but do you have an index on the table that you are using to

store the URLs that still need to be parsed? This table is going to get

huge! And if you do not delete the URL that you just parsed from the list it

will grow even faster. And if you do not have an index on that table and you

are doing a table scan to see if the new URL is in it or not, this is going

to take longer and longer to complete every time you process another URL.

This is because this temp table of URLs to process will always get larger,

and will rarely go down in size because you add about 5+ new URLs for every

one that you process. 

But then again, we don't know for sure on anything without seeing 'some'

code. So far we have not seen any so everything is total speculation and

guessing. I would be interested in seeing the code that handles the

processing of the URLs once you cull them from a web page. 

Jim Hunter

-------Original Message-------

From: Nicholas Fitzgerald

Date: Wednesday, March 12, 2003 10:15:52 AM

To: php-db@lists.php.net

Subject: Re:  Real Killer App!

Rich Gray wrote:

I'm having a heck of a time trying to write a little web crawler for my

intranet. I've got everything functionally working it seems like, but

there is a very strange problem I can't nail down. If I put in an entry

and start the crawler it goes great through the first loop. It gets the

url, gets the page info, puts it in the database, and then parses all of

the links out and puts them raw into the database. On the second loop it

picks up all the new stuff and does the same thing. By the time the

second loop is completed I'll have just over 300 items in the database.

On the third loop is where the problem starts. Once it gets into the

third loop, it starts to slow down a lot. Then, after a while, if I'm

running from the command line, it'll just go to a command prompt. If I'm

running in a browser, it returns a "document contains no data" error.

This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux

box yet, but I'd rather run it on the windows server since it's bigger

and has plenty of cpu, memory, and raid space. It's almost like the

thing is getting confused when it starts to get more than 300 entries in

the database. Any ideas out there as to what would cause this kind of

problem?

Nick

Can you post some code? Are your script timeouts set appropriately? Does

memory/CPU useage increase dramatically or are there any other symptoms of

where it is choking...? What DB is it updating? What does the database tell

you is happening when it starts choking? What do debug messages tell you

wrt

finding the bottleneck? Does it happen always no matter what start point is

used? Are you using recursive functions?

Sorry lots of questions but no answers... :)

Cheers

Rich

Recognizing that this script would take a long time to run I'm using 

set_time_limit(0) in it so a timeout doesn't become an issue. The server 

has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never 

seen it get over 15% cpu usage, even while this is going on, and it 

never gets anywhere near full memory usage. The tax on the system itself 

is actually negligable. There are no symptoms that I can find to 

indicate where the chokepoint might be. It seems to be when the DB 

reaches a certain size, but 300 or so records should be a piece of cake 

for it. As far as the debug backtrace, there really isn't anything there 

that stands out. It's not an issue with a variable, something is going 

wrong in the execution either of php, or a sql query. I'm not finding 

any errors in the mysql error log, or anywhere else.

Basically the prog is in two parts. First, it goes and gets the current 

contents of the DB, one record at a time, and checks it. If it meets the 

criteria it is then indexed or reindexed. If it is indexed, then it goes 

to the second part. This is where it strips any links from the page and 

puts them in the DB for indexing, if thery're not already there. When it 

dies, this is where it dies. I'll get the "UPDATING: <title><url> 

message that comes up when it does an update, but at that point, where 

it is going into strip links, it dies right there.

Nick