Re: Real Killer App!

Nicholas Fitzgerald <nick@axelis.com> · Thu, 13 Mar 2003 11:36:39 -0800

Ok, here's something else I've just noticed about this problem. I 
noticed that when this things gets to a certain point, somewhere in the 
4th run through, it hits a certain url, then the next, at which point it 
seems to pause for several seconds, then it goes back and hits that 
first certain url. It looks something like this:

Updating: http://www.domain.com/certainurl.html
Updating: http://www.domain.com/nexturl.html
Updating: http://www.domain.com/certainurl.html

It only does this in one place, though not necessarily the same place, 
depending on where I start, then, afterwards, it goes through a few more 
pages, and that's where it dies. Another thing is, if I start from 
scratch with a site that has 69 pages in it, it goes right through and 
indexes it fine without any problems. I have two other sites I'm using 
for testing, and they both have over 1000 pages. on both of them it dies 
after putting about 240-250 records in the database, usually right at 
244. This is true of both sites, and it happens no matter where I start 
or how I configure the initial URL.I keep going back to a memory problem 
with either PHP or MySQL, or could I be looking in the wrong place? 
Could the problem be with Apache?

I also wanted to thank you guys. I've gotten a few really good 
suggestions about the code, and although they haven't solved this 
specific problem, they have helped me improve the overall performance of 
the app.

Nick

Paul Burney wrote:

on 3/12/03 5:45 PM, Nicholas Fitzgerald at nick@axelis.com appended the
following bits to my mbox:

is that entire prog as it now exists. Notice I have NOT configured it as

yet to into the next level. I did this on purpose so I wouldn't have to

kill it in the middle of operation and potencially scew stuff up. They

way it is now, it looks at all the records in the database, updates them

if necessary, then extracts all the links and puts them into the

database for crawling on the next run through. Once I get this working

I'll put a big loop in it so it keeps going until there's nothing left

to look at. Meanwhile, if anyone sees anything in here that could be the

cause of this problem please let me know!

I don't think I've found the problem, but I thought I'd point out a couple
things:

// Open the database and start looking at URLs

$sql = mysql_query("SELECT * FROM search");

while($rslt = mysql_fetch_array($sql)){

 $url = $rslt["url"];

The above line gets all the data from the table and then starts looping
through...

// Put the stuff in the search database

 $puts = mysql_query("SELECT * FROM search WHERE url='$url'");

 $site = mysql_fetch_array($puts);

 $nurl = $site["url"];

 $ncrc = $site["checksum"];

 $ndate = $site["date"];

 if($ndate <= $daycheck || $ncrc != $checksum){

That line does the same query again for this particular URL to set variables
in the $site array, though you already have this info in the $rslt array.
You could potentially save hundreds of queries there.

// Get the page title

 $temp = stristr($read,"<title>");

<snip>

 $tchn = ($tend - $tpos);

 $title = strip_tags(substr($read, ($tpos+7),$tchn));

Aside: Interesting way of doing things.  I usually just preg_match these
things, but I like this too.

 // Kill any trailing slashes

     if(substr($link,(strlen($link)-1)) == "/"){

         $link = substr($link,0,(strlen($link)-1));

     }

Why are you killing the trailing slashes?  That's going to cause fopen
double the work to get to the pages.  That is, first it will request the
page without the slash, then get a redirect response with the slash, and
then request the page again.

 // Put the new URL in the search database

     $chk = mysql_query("SELECT * FROM search WHERE url = '$link'");

     $curec = mysql_fetch_array($chk);

     if(!$curec){

         echo "Adding: $link\n";

         $putup = mysql_query("INSERT INTO search SET url='$link'");

     }

     else{

         continue;

     }

You might want to give a different variable name to the "new link", or
encapsulate the above in a function, so your $link variables don't clobber
each other.

indicate where the chokepoint might be. It seems to be when the DB

reaches a certain size, but 300 or so records should be a piece of cake

for it. As far as the debug backtrace, there really isn't anything there

that stands out. It's not an issue with a variable, something is going

wrong in the execution either of php, or a sql query. I'm not finding

any errors in the mysql error log, or anywhere else.

What url is it dieing on?  You could probably echo each $url to the terminal
to watch it's progression and see where it is stopping.

I've had problems with apache using custom php error docs where the error
doc contained a php generated image that wasn't found.  Each image that
failed would generate another PHP error which cascaded until the server
basically died.

KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS:

I've also had recursion problems because php allows any characters to be
appended after the request.  For example, let's say you have an examples.php
file and for some reason you have a relative link in  examples.php to
examples/somefile.html.  If the examples directory doesn't exist, apache
will serve examples.php to the user using the request of
examples/somefile.html.  A recursive search engine (that isn't too smart,
i.e., infoseek and excite for colleges), will keep requesting things like:

http://example.com/examples/examples/examples/examples/examples/examples/exa
mples/examples/examples/examples/examples/examples/examples/examples/example
s/examples/examples/examples/examples/examples/examples/examples/examples/ex
amples/examples/examples/examples/examples/examples/examples/examples/exampl
es/examples/examples/examples/examples/examples/examples/examples/examples/e
xamples/examples/somefile.html

As far as apache is concerned, it is fulfilling the request with the
examples.php file and php just sees a really long query_string starting with
/examples.

I'm sure that isn't your problem, but I've been bit by it a few times.

END OF ASIDE

Hope some of that ramble helps.  Please try to see if it is dieing on a
particular URL so we can be of further assistance.

Sincerely,

Paul Burney
<http://paulburney.com/>

Q: Tired of creating admin interfaces to your MySQL web applications?

A: Use MySTRI instead. Version 3.1 now available.
                           <http://mystri.sourceforge.net/>