Re: Real Killer App!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



on 3/12/03 5:45 PM, Nicholas Fitzgerald at nick@axelis.com appended the
following bits to my mbox:

> is that entire prog as it now exists. Notice I have NOT configured it as
> yet to into the next level. I did this on purpose so I wouldn't have to
> kill it in the middle of operation and potencially scew stuff up. They
> way it is now, it looks at all the records in the database, updates them
> if necessary, then extracts all the links and puts them into the
> database for crawling on the next run through. Once I get this working
> I'll put a big loop in it so it keeps going until there's nothing left
> to look at. Meanwhile, if anyone sees anything in here that could be the
> cause of this problem please let me know!

I don't think I've found the problem, but I thought I'd point out a couple
things:

> // Open the database and start looking at URLs
> $sql = mysql_query("SELECT * FROM search");
> while($rslt = mysql_fetch_array($sql)){
>   $url = $rslt["url"];

The above line gets all the data from the table and then starts looping
through...

> // Put the stuff in the search database
>   $puts = mysql_query("SELECT * FROM search WHERE url='$url'");
>   $site = mysql_fetch_array($puts);
>   $nurl = $site["url"];
>   $ncrc = $site["checksum"];
>   $ndate = $site["date"];
>   if($ndate <= $daycheck || $ncrc != $checksum){

That line does the same query again for this particular URL to set variables
in the $site array, though you already have this info in the $rslt array.
You could potentially save hundreds of queries there.

> // Get the page title
>   $temp = stristr($read,"<title>");
<snip>
>   $tchn = ($tend - $tpos);
>   $title = strip_tags(substr($read, ($tpos+7),$tchn));

Aside: Interesting way of doing things.  I usually just preg_match these
things, but I like this too.


>   // Kill any trailing slashes
>       if(substr($link,(strlen($link)-1)) == "/"){
>           $link = substr($link,0,(strlen($link)-1));
>       }

Why are you killing the trailing slashes?  That's going to cause fopen
double the work to get to the pages.  That is, first it will request the
page without the slash, then get a redirect response with the slash, and
then request the page again.

>   // Put the new URL in the search database
>       $chk = mysql_query("SELECT * FROM search WHERE url = '$link'");
>       $curec = mysql_fetch_array($chk);
>       if(!$curec){
>           echo "Adding: $link\n";
>           $putup = mysql_query("INSERT INTO search SET url='$link'");
>       }
>       else{
>           continue;
>       }

You might want to give a different variable name to the "new link", or
encapsulate the above in a function, so your $link variables don't clobber
each other.

>> indicate where the chokepoint might be. It seems to be when the DB
>> reaches a certain size, but 300 or so records should be a piece of cake
>> for it. As far as the debug backtrace, there really isn't anything there
>> that stands out. It's not an issue with a variable, something is going
>> wrong in the execution either of php, or a sql query. I'm not finding
>> any errors in the mysql error log, or anywhere else.

What url is it dieing on?  You could probably echo each $url to the terminal
to watch it's progression and see where it is stopping.

I've had problems with apache using custom php error docs where the error
doc contained a php generated image that wasn't found.  Each image that
failed would generate another PHP error which cascaded until the server
basically died.

KIND OF BROADER ASIDE REGARDING SEARCH ENGINE PROBLEMS:

I've also had recursion problems because php allows any characters to be
appended after the request.  For example, let's say you have an examples.php
file and for some reason you have a relative link in  examples.php to
examples/somefile.html.  If the examples directory doesn't exist, apache
will serve examples.php to the user using the request of
examples/somefile.html.  A recursive search engine (that isn't too smart,
i.e., infoseek and excite for colleges), will keep requesting things like:

http://example.com/examples/examples/examples/examples/examples/examples/exa
mples/examples/examples/examples/examples/examples/examples/examples/example
s/examples/examples/examples/examples/examples/examples/examples/examples/ex
amples/examples/examples/examples/examples/examples/examples/examples/exampl
es/examples/examples/examples/examples/examples/examples/examples/examples/e
xamples/examples/somefile.html

As far as apache is concerned, it is fulfilling the request with the
examples.php file and php just sees a really long query_string starting with
/examples.

I'm sure that isn't your problem, but I've been bit by it a few times.

END OF ASIDE

Hope some of that ramble helps.  Please try to see if it is dieing on a
particular URL so we can be of further assistance.

Sincerely,

Paul Burney
<http://paulburney.com/>

Q: Tired of creating admin interfaces to your MySQL web applications?

A: Use MySTRI instead. Version 3.1 now available.
                            <http://mystri.sourceforge.net/>



-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [PHP Users]     [Postgresql Discussion]     [Kernel Newbies]     [Postgresql]     [Yosemite News]

  Powered by Linux