Re: Real Killer App!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ok, here's something else I've just noticed about this problem. I noticed that when this things gets to a certain point, somewhere in the 4th run through, it hits a certain url, then the next, at which point it seems to pause for several seconds, then it goes back and hits that first certain url. It looks something like this:

Updating: http://www.domain.com/certainurl.html
Updating: http://www.domain.com/nexturl.html
Updating: http://www.domain.com/certainurl.html

It only does this in one place, though not necessarily the same place, depending on where I start, then, afterwards, it goes through a few more pages, and that's where it dies. Another thing is, if I start from scratch with a site that has 69 pages in it, it goes right through and indexes it fine without any problems. I have two other sites I'm using for testing, and they both have over 1000 pages. on both of them it dies after putting about 240-250 records in the database, usually right at 244. This is true of both sites, and it happens no matter where I start or how I configure the initial URL.I keep going back to a memory problem with either PHP or MySQL, or could I be looking in the wrong place? Could the problem be with Apache?

I also wanted to thank you guys. I've gotten a few really good suggestions about the code, and although they haven't solved this specific problem, they have helped me improve the overall performance of the app.

Nick





Paul Burney wrote:

on 3/12/03 5:45 PM, Nicholas Fitzgerald at nick@axelis.com appended the
following bits to my mbox:



is that entire prog as it now exists. Notice I have NOT configured it as
yet to into the next level. I did this on purpose so I wouldn't have to
kill it in the middle of operation and potencially scew stuff up. They
way it is now, it looks at all the records in the database, updates them
if necessary, then extracts all the links and puts them into the
database for crawling on the next run through. Once I get this working
I'll put a big loop in it so it keeps going until there's nothing left
to look at. Meanwhile, if anyone sees anything in here that could be the
cause of this problem please let me know!



I don't think I've found the problem, but I thought I'd point out a couple things:



// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search");
while($rslt = mysql_fetch_array($sql)){
$url = $rslt["url"];



The above line gets all the data from the table and then starts looping through...



// Put the stuff in the search database
$puts = mysql_query("SELECT * FROM search WHERE url='$url'");
$site = mysql_fetch_array($puts);
$nurl = $site["url"];
$ncrc = $site["checksum"];
$ndate = $site["date"];
if($ndate <= $daycheck || $ncrc != $checksum){



That line does the same query again for this particular URL to set variables in the $site array, though you already have this info in the $rslt array. You could potentially save hundreds of queries there.



// Get the page title
$temp = stristr($read,"<title>");


<snip>


$tchn = ($tend - $tpos);
$title = strip_tags(substr($read, ($tpos+7),$tchn));



Aside: Interesting way of doing things. I usually just preg_match these things, but I like this too.




// Kill any trailing slashes
if(substr($link,(strlen($link)-1)) == "/"){
$link = substr($link,0,(strlen($link)-1));
}



Why are you killing the trailing slashes? That's going to cause fopen double the work to get to the pages. That is, first it will request the page without the slash, then get a redirect response with the slash, and then request the page again.



// Put the new URL in the search database
$chk = mysql_query("SELECT * FROM search WHERE url = '$link'");
$curec = mysql_fetch_array($chk);
if(!$curec){
echo "Adding: $link\n";
$putup = mysql_query("INSERT INTO search SET url='$link'");
}
else{
continue;
}



You might want to give a different variable name to the "new link", or encapsulate the above in a function, so your $link variables don't clobber each other.



indicate where the chokepoint might be. It seems to be when the DB
reaches a certain size, but 300 or so records should be a piece of cake
for it. As far as the debug backtrace, there really isn't anything there
that stands out. It's not an issue with a variable, something is going
wrong in the execution either of php, or a sql query. I'm not finding
any errors in the mysql error log, or anywhere else.



What url is it dieing on? You could probably echo each $url to the terminal to watch it's progression and see where it is stopping.

I've had problems with apache using custom php error docs where the error
doc contained a php generated image that wasn't found.  Each image that
failed would generate another PHP error which cascaded until the server
basically died.

KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS:

I've also had recursion problems because php allows any characters to be
appended after the request.  For example, let's say you have an examples.php
file and for some reason you have a relative link in  examples.php to
examples/somefile.html.  If the examples directory doesn't exist, apache
will serve examples.php to the user using the request of
examples/somefile.html.  A recursive search engine (that isn't too smart,
i.e., infoseek and excite for colleges), will keep requesting things like:

http://example.com/examples/examples/examples/examples/examples/examples/exa
mples/examples/examples/examples/examples/examples/examples/examples/example
s/examples/examples/examples/examples/examples/examples/examples/examples/ex
amples/examples/examples/examples/examples/examples/examples/examples/exampl
es/examples/examples/examples/examples/examples/examples/examples/examples/e
xamples/examples/somefile.html

As far as apache is concerned, it is fulfilling the request with the
examples.php file and php just sees a really long query_string starting with
/examples.

I'm sure that isn't your problem, but I've been bit by it a few times.

END OF ASIDE

Hope some of that ramble helps.  Please try to see if it is dieing on a
particular URL so we can be of further assistance.

Sincerely,

Paul Burney
<http://paulburney.com/>

Q: Tired of creating admin interfaces to your MySQL web applications?

A: Use MySTRI instead. Version 3.1 now available.
                           <http://mystri.sourceforge.net/>







[Index of Archives]     [PHP Home]     [PHP Users]     [Postgresql Discussion]     [Kernel Newbies]     [Postgresql]     [Yosemite News]

  Powered by Linux