On Thu, Jun 23, 2005 at 02:13:48PM -0700, Judith Lebzelter wrote: > > > On Thu, 23 Jun 2005, Dave Hansen wrote: > > > On Mon, 2005-05-02 at 13:41 -0700, Judith Lebzelter wrote: > > > We have a cron job that will hit the patch directory once every three > > > hours to check for new patches. It also pulls a patch if it finds one. > > > This is the same schedule we use for other kernel patches. > > > > Is there a chance that you guys could update your PLM fetcher a little > > bit? It likes to go looking for files that aren't actually present on > > my web server, Which files are those? I can probably filter those more tightly in package_retriever. > > which generates a fair amount of log output that I > > usually like to keep an eye on. That isn't a big problem in and of > > itself, it just bloats the logs. Yeah, spidering is kind of a blunt instrument in general... I notice that new patches aren't posted too often. Is it worth it for us to be scanning for new patches so much? > > The user agent strings are now something like: "PLM/0.1" and > > "lwp-trivial/1.41". Could they, perhaps, get a little bit more > > informative, like: > > > > "PLM/0.1 patch spider http://developer.osdl.org/dev/stp/" > > > > Good idea to give more info. We had a problem with a busy robot last week > and I was very happy they gave us enough info. I've updated the script that makes the lwp agent string to use curl and to print an agent string like this: package_retriever/1.00 <hostname> spider <descriptive-comment> I'm going to add a few more changes to this script, so it may be a few days before I am able to switch over to it. Also, are you sure it's reporting lwp-trivial? I was actually using lwp-simple. > > Also, it would be kind if they obeyed robots.txt, or at least fetched > > it. My log analyzer will detect robots based just on fetching > > "robots.txt" when beginning a crawl. > > We should be obeying robots.txt. > > We will be able to do these updates but it may take a little while to get > to them and deploy. I've implemented robot support for package_retriever. Doesn't look like I navigate through the directory that robots.txt is blocking, but I guess this'll give you the ability to control what the script spiders. Bryce