On Tue, 5 Oct 2010 16:16:06 +0300, Isaac Witmer <isaaclw@xxxxxxxxx> wrote: > How would you do it? > with wget, the only way of having it crawl through websites, is to > recurse... isn't it? For the wget command line yes. But that is not the only piece involved. You need to make two access_log entries for Squid. One that pipes only the client requests to the wget script and one that records only the wget requests. see http://www.squid-cache.org/Doc/config/access_log for details on log ACLs. This method give you two very important effects; 1) wget requests are not passed to wget to fetch twice or more... 2) you have a logs to compare the bandwidth consumption of wget vs non-wget traffic what I think you will find is that the pre-caching wastes more bandwidth overall than non pre-caching, possibly to the point of slowing access for the real requests. It is also useless on Web2.0 websites with ajax etc. NP: you may want to write a logging daemon to do all this instead of tailing the access log. That will give you log data in real-time without any problems during rotation/reconfigure/restart. Amos > > I tried screwing around, and the best I came up with was this: > >>#!/bin/bash >>log="/var/log/squid3/access.log" >> >>while (true); do >> echo "reading started: `date`, log file: $log" >> sudo tail -n 80 $log | grep -P "/200 [0-9]+ GET" | grep "text/html" | >> awk '{print $7}' | wget -q -rp -nd -l 1 --delete-after -i - >> sleep 5 >> echo >>done > > > It's not so clean... > > On Tue, Oct 5, 2010 at 11:51 AM, John Doe <jdmls@xxxxxxxxx> wrote: >> >> From: flaviane athayde <flavianeathayde@xxxxxxxxx> >> >> > I try to put a shell script that read the Squid log, and use it to >> > run >> > wget with "-r -l1 -p" flag, but it also get its on pages, making a >> > infinit loop, and I can't resolve it. >> >> Why recurse? >> If you take your list from the log files, you will get all accessed files >> already... no? >> >> JD >> >> >>