[Hotplug_sig] Re: PLM patch spider

bryce at osdl.org (Bryce Harrington) · Mon Jun 27 15:36:29 2005

On Thu, Jun 23, 2005 at 02:13:48PM -0700, Judith Lebzelter wrote:
> 
> 
> On Thu, 23 Jun 2005, Dave Hansen wrote:
> 
> > On Mon, 2005-05-02 at 13:41 -0700, Judith Lebzelter wrote:
> > > We have a cron job that will hit the patch directory once every three 
> > > hours to check for new patches.  It also pulls a patch if it finds one.  
> > > This is the same schedule we use for other kernel patches.
> > 
> > Is there a chance that you guys could update your PLM fetcher a little
> > bit?  It likes to go looking for files that aren't actually present on
> > my web server,

Which files are those?  I can probably filter those more tightly in
package_retriever.

> > which generates a fair amount of log output that I
> > usually like to keep an eye on.  That isn't a big problem in and of
> > itself, it just bloats the logs.

Yeah, spidering is kind of a blunt instrument in general...  I notice
that new patches aren't posted too often.  Is it worth it for us to be
scanning for new patches so much?

> > The user agent strings are now something like: "PLM/0.1" and
> > "lwp-trivial/1.41".  Could they, perhaps, get a little bit more
> > informative, like:
> > 
> > 	"PLM/0.1 patch spider http://developer.osdl.org/dev/stp/";
> > 
> 
> Good idea to give more info.  We had a problem with a busy robot last week 
> and I was very happy they gave us enough info.

I've updated the script that makes the lwp agent string to use curl and
to print an agent string like this:

  package_retriever/1.00 <hostname> spider <descriptive-comment>

I'm going to add a few more changes to this script, so it may be a few
days before I am able to switch over to it.

Also, are you sure it's reporting lwp-trivial?  I was actually using
lwp-simple.

> > Also, it would be kind if they obeyed robots.txt, or at least fetched
> > it.  My log analyzer will detect robots based just on fetching
> > "robots.txt" when beginning a crawl.
> 
> We should be obeying robots.txt.
> 
> We will be able to do these updates but it may take a little while to get 
> to them and deploy.

I've implemented robot support for package_retriever.  Doesn't look like
I navigate through the directory that robots.txt is blocking, but I
guess this'll give you the ability to control what the script spiders.

Bryce