Re: PHP Search Engine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, November 8, 2005 11:20 pm, Leonard Burton wrote:
> Has anyone on here created a search engine in PHP?

Sure, of sorts, now and again, here and there, to some degree.

Though it was at the lower end of search engine, possible devolving to
web-scraping, when you get right down to it...

> I have project to create a search engine that will search about 50K or
> so pages of information on 100 or so various domain names.
>
> What have you all done in the past?

http://php.net/file_get_contents
http://php.net/mysql
http://php.net/strstr
http://php.net/preg_match
and a cron job (several, actually)

How fancy does it need to be?

Are you going to attempt to search all of those in real-time?

Surely not.

You'd be slamming each domain name to the tune of:
50K pages / 100 domain names == 5000 page hits per domain

So you HAVE to rule that out right from the get-go.

Which means you're going to have to cache that many pages
somehow/somewhere.  You know that, right?

Because, basically, you could write a crude simplistic search engine
in a few days with the functions linked above, assuming you are
familiar with most of them, and know MySQL (or other DB) fairly well.

You'd want to queue up links to be indexed, and time/stagger them
based on domain name (actually, probably IP address of domains) so
that you don't visit any site too heavily.

I'd also recommend breaking the process up into several stages:

TASK #1:
1. Choosing a URL to index, based on IP and least-recently-visited
with a minimum time between visits.

2. Just snarf and cache the raw HTML data for that URL, and mark it
"done" with a time-stamp, so step 1 above won't do it again, and won't
hit the same IP too soon.


TASK #2:
1. Parse one downloaded file and "index" the interesting words (data,
content, images, whatever) inside it, and store that data in a
format/schema which allows quick search/access of likely queries,
ignoring useless words/data/content (the word "the" is not worth
indexing, really)

2. Mark that downloaded/cached file "done" as far as indexing goes.

TASK #3:

1. Search the downloaded file for "interesting" URLs to be indexed,
and queue them up for TASK #1 to handle "later"

2. Mark the downloaded/cached file "done" as far as spidering goes.


TASK #4:
1. Purge downloaded/cached files (or db records or whatever) that have
been marked "done" by both TASK #2 and #3

You can then set up cron jobs with varying frequency to perform each
TASK as needed.  Possibly even with more resources devoted to TASK #1
during low-bandwidth hours (typically late-night US time, for US-based
sites) but bumping up the cron intervals for TASKS 2/3 in the daytime.

None of this is Rocket Science, really, except indexing the
"interesting" content, and that is so domain-specific, we can't help
much with that, other than the general principles... MySQL fulltext
indexing would possibly take care of that for you, if you don't really
want to sweat on it too hard for now.

> PHPdig was a failure.

In what way[s] did it fail?

Speed performance?

Caching?

URL equivalence identification?

Identifying embedded links?

Accessing password-protected resources?

JavaScript execution?
(Not that I think any search engine has that, but what do I know?)

Other?

I have no idea what PHPdig does or how it works, but telling us it
"failed" is not particularly useful, other than to rule it out as a
possible suggestion.

I am reasonably certain that if you Googled for:
"PHP web spider framework"
you would find several packages that would have at least 99% of what
you need...  Because these simply have to exist out there.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux