Re: offtopic questions??

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 22Sep2018 19:10, bruce <badouglas@xxxxxxxxx> wrote:
My questions would probably be how to speed up something, or how to
possibly redo/re-architect part of the crawl process.

Well, it is technically not Fedora specific, but this place seems the most active shell-related list I'm on. So "how do I improve this web crawling shell script?" might be ok. Disclaimer: not a list admin.

As an example, I have a situation where I use cheap cloud vms
(digitalocean) to perform the fetches. The fetches are basic "curl"
with the required attributes. The curl also includes a "Cookie" for
the curl/fetch for the target server. When running from the given ip
address of the vm the target "blocks" the fetch. (I guess someone else
cold have tried to fetch a bunch earlier -- who knows). So, I use an
anonymous proxy-server ip to then generate the fetch. This process
works, but it's slow. So, the process runs a number of these in
parallel at the same time on the cheap droplet. While this speeds
things up, still "slow"... I've also tested running curl with multi
"http" urls in the same curl.

Multiple URLs in wget or curl are done in series.

You can get quite agressive with parallelism in the shell, but it isn't the best thing for fine grained control of lots of subprocesses because you can't trivially "wait for a _single_ one of my subprocesses to complete", so the obvious "read URLs and dispatch background curls up to some limit, then wait for one to complete before kicking off the next" isn't so easy. (You can wait for a specific pid, but that's no help when you don't know which fetch will complete first.)

You can do things like firing them off in their own subshell which writes a line to a file or pipe on completion, then track completion by monitoring that log.

A burstly alternative looks like this:

 ( # subshell to ensure no unexpected children
   maxbg=128   # pick a number
   nbg=0
   while read -r url <&3
   do
     curl ... "$url" &
     nbg=$(( nbg + 1 ))
     [ $nbg -lt $maxbg ] || { wait; nbg=0; }
   done 3<urls.txt
   wait
 )

which first of bursts of 128 curls and waits for them all, then runs another burst etc.

If you want finer control, maybe move to Python using the requests library to do fetches and threads for the parallelism. But then you need to learn Python (highly recommended anyway, but a further hurdle to your initial task).

Cheers,
Cameron Simpson <cs@xxxxxxxxxx>
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux