On Tue, November 8, 2005 11:20 pm, Leonard Burton wrote: > Has anyone on here created a search engine in PHP? Sure, of sorts, now and again, here and there, to some degree. Though it was at the lower end of search engine, possible devolving to web-scraping, when you get right down to it... > I have project to create a search engine that will search about 50K or > so pages of information on 100 or so various domain names. > > What have you all done in the past? http://php.net/file_get_contents http://php.net/mysql http://php.net/strstr http://php.net/preg_match and a cron job (several, actually) How fancy does it need to be? Are you going to attempt to search all of those in real-time? Surely not. You'd be slamming each domain name to the tune of: 50K pages / 100 domain names == 5000 page hits per domain So you HAVE to rule that out right from the get-go. Which means you're going to have to cache that many pages somehow/somewhere. You know that, right? Because, basically, you could write a crude simplistic search engine in a few days with the functions linked above, assuming you are familiar with most of them, and know MySQL (or other DB) fairly well. You'd want to queue up links to be indexed, and time/stagger them based on domain name (actually, probably IP address of domains) so that you don't visit any site too heavily. I'd also recommend breaking the process up into several stages: TASK #1: 1. Choosing a URL to index, based on IP and least-recently-visited with a minimum time between visits. 2. Just snarf and cache the raw HTML data for that URL, and mark it "done" with a time-stamp, so step 1 above won't do it again, and won't hit the same IP too soon. TASK #2: 1. Parse one downloaded file and "index" the interesting words (data, content, images, whatever) inside it, and store that data in a format/schema which allows quick search/access of likely queries, ignoring useless words/data/content (the word "the" is not worth indexing, really) 2. Mark that downloaded/cached file "done" as far as indexing goes. TASK #3: 1. Search the downloaded file for "interesting" URLs to be indexed, and queue them up for TASK #1 to handle "later" 2. Mark the downloaded/cached file "done" as far as spidering goes. TASK #4: 1. Purge downloaded/cached files (or db records or whatever) that have been marked "done" by both TASK #2 and #3 You can then set up cron jobs with varying frequency to perform each TASK as needed. Possibly even with more resources devoted to TASK #1 during low-bandwidth hours (typically late-night US time, for US-based sites) but bumping up the cron intervals for TASKS 2/3 in the daytime. None of this is Rocket Science, really, except indexing the "interesting" content, and that is so domain-specific, we can't help much with that, other than the general principles... MySQL fulltext indexing would possibly take care of that for you, if you don't really want to sweat on it too hard for now. > PHPdig was a failure. In what way[s] did it fail? Speed performance? Caching? URL equivalence identification? Identifying embedded links? Accessing password-protected resources? JavaScript execution? (Not that I think any search engine has that, but what do I know?) Other? I have no idea what PHPdig does or how it works, but telling us it "failed" is not particularly useful, other than to rule it out as a possible suggestion. I am reasonably certain that if you Googled for: "PHP web spider framework" you would find several packages that would have at least 99% of what you need... Because these simply have to exist out there. -- Like Music? http://l-i-e.com/artists.htm -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php