On 22Sep2018 19:10, bruce <badouglas@xxxxxxxxx> wrote:
My questions would probably be how to speed up something, or how to
possibly redo/re-architect part of the crawl process.
Well, it is technically not Fedora specific, but this place seems the most
active shell-related list I'm on. So "how do I improve this web crawling shell
script?" might be ok. Disclaimer: not a list admin.
As an example, I have a situation where I use cheap cloud vms
(digitalocean) to perform the fetches. The fetches are basic "curl"
with the required attributes. The curl also includes a "Cookie" for
the curl/fetch for the target server. When running from the given ip
address of the vm the target "blocks" the fetch. (I guess someone else
cold have tried to fetch a bunch earlier -- who knows). So, I use an
anonymous proxy-server ip to then generate the fetch. This process
works, but it's slow. So, the process runs a number of these in
parallel at the same time on the cheap droplet. While this speeds
things up, still "slow"... I've also tested running curl with multi
"http" urls in the same curl.
Multiple URLs in wget or curl are done in series.
You can get quite agressive with parallelism in the shell, but it isn't the
best thing for fine grained control of lots of subprocesses because you can't
trivially "wait for a _single_ one of my subprocesses to complete", so the
obvious "read URLs and dispatch background curls up to some limit, then wait
for one to complete before kicking off the next" isn't so easy. (You can wait
for a specific pid, but that's no help when you don't know which fetch will
complete first.)
You can do things like firing them off in their own subshell which writes a
line to a file or pipe on completion, then track completion by monitoring that
log.
A burstly alternative looks like this:
( # subshell to ensure no unexpected children
maxbg=128 # pick a number
nbg=0
while read -r url <&3
do
curl ... "$url" &
nbg=$(( nbg + 1 ))
[ $nbg -lt $maxbg ] || { wait; nbg=0; }
done 3<urls.txt
wait
)
which first of bursts of 128 curls and waits for them all, then runs another
burst etc.
If you want finer control, maybe move to Python using the requests library to
do fetches and threads for the parallelism. But then you need to learn Python
(highly recommended anyway, but a further hurdle to your initial task).
Cheers,
Cameron Simpson <cs@xxxxxxxxxx>
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx