Re: Subject: Searching remote web sites for content

"Neil Smith [MVP, Digital media]" <php@xxxxxxxxxxxxxxxxxxxxxxxx> · Sun, 23 Oct 2005 12:52:14 +0100

At 06:26 23/10/2005, you wrote:
Message-ID: <8d9a42800510221021l54d3ba35y111666680ac3b643@xxxxxxxxxxxxxx>
Date: Sat, 22 Oct 2005 13:21:26 -0400
From: Joseph Crawford <codebowl@xxxxxxxxx>
To: " Mailing List" <php-db@xxxxxxxxxxxxx>
MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----=_Part_33359_9054580.1130001686839"
Subject: Re:  Re: Subject: Searching remote web sites for content

why do all that,

Oh, it's far less work than the method you're proposing - you only 
have one site to fopen() not many dozens. There's no 'all that' to it 
- it's the same method we're discussing, but more optimal (see point 3)

 if you know the address of the page that the link will
reside on just curl that page for the results and preg_match that.

Ref the OP : "I ask them to nominate where the link back page is, and 
I could check this manually.  But is there a way to check whether the 
remote page links back using a php script, so that I could get a 
report and follow up on exceptions, without having to check all pages 
that say they link to my site?"

Three reasons : 1 is because the nomination process might be poorly 
understood by the nominee, or they could be inept and place the link 
somewhere other than where they specified (or move it about once 
nominated). You'd need to be able to crawl their entire site in order 
to automate the scan on a regular basis, or you're back to " and I 
could check this manually"

2 is that unless you want to write a very very robust parser, you may 
as well rely on google's hard work writing such a parser. You can't 
be sure *how* the referring webmaster has set up his links (re:inept) 
so they could occur in a wide range of formats. The results from 
google come in a regular format, so they're easy to parse - and you 
said yourself you're not too certain of the regex you'd need - why 
complicate it by having to cover dozens of eventualities ?

3 is that the point of the exercise is to ensure goos SE rankings by 
having referring links of high relevance. Only google knows how that 
relevance ranking results in a search index placement based on link 
popularity -  and that includes using hidden links to 'spam' the 
search engine, whic you don't want.

So, relying on google to spider the remote site is a way to ensure 
your QA process for the link referrals really does result in a usable 
link:mysite index in the search engine - which of course is *the 
whole point of the exercise* !

HTH
Cheers - Neil  

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php