Re: off topic: combined output of concurrent processes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15Apr2012 05:52, Amadeus W.M. <amadeus84@xxxxxxxxxxx> wrote:
| With this exact script, it works for FOO (probably because it's short). 
| For FOOOOOOOOOO...(1000 Os) I see again fewer than 100 lines in "zot". 
| This, if I iterate 100 times. If I iterate, say, 10-20 times only, I seem 
| to get all the lines. Can it have something to do with the number of jobs 
| executed in the background?
| 
| The real code is like this:
| 
| #!/bin/bash
| for url in $(cat myURLs)
| do
| 	curl -s $url &
| done

[ Potential solution at bottom of post. ]

Ok, another question: is curl using ONLY the -s (silent) option? And
specificly, it is NOT using -C (continue)?

Curl can behave differently if its output is not a terminal. Aside from
turning off the progress if the output _is_ a terminal, with -C it uses
the output file to figure out how much data to request; if it thinks the
file is already patially fetched it won't fetch the front part.

Finally, it is conceivable that curl might seek() the file.

The reason I suggest this is that you said using a >> made it good.
Normally (with a single output file, opened just the once) they would
behave the same. But suppose curl, internally, seek()ing to a particular
position. With a file opened for append that does nothing (the next
write will go at the end of the file anyway) but if not then the seek
would reposition the file pointer and overwrites would occur.

Curl's got no good reason to do that (even with a -C option), but it
might; if we suspect this some tests using the strace command can tell
us.

However, we can work around this whole issue and solve two problems:
  - the sharing of th output file, which we _suspect_ may be triggering
    bad behaviour from curl
  - the possible interleaving of curl outputs: curl _will_ get data from
    the URL in chunks, and parallel curls will interleave their output
    chunks

Look at this (completely untested) loop:

  # a little setup
  cmd=`basename "$0"`
  : ${TMPDIR:=/tmp}
  tmppfx=$TMPDIR/$cmd.$$

  i=0
  while read -r url
  do
    i=$((i+1))
    out=$tmppfx.$i
    if curl -s "$url" >"$out"
    then  echo "$out"
    else  echo "$cmd: curl fails on: $url" >&2
    fi &
  done < myURLs \
  | while read -r out
    do
      cat "$out"
      rm "$out"
    done \
  | tee all-data.out \
  | your-data-parsing-program

This program does a few things:
  - gives each curl its own output file, avoiding our issues
  - runs them all in parallel, achieving your aim
  - never interleaves one curl with another;
    the second loop reads each completed output file in turn, completely
  - takes a copy using tee to the file all-data.out, just so you can
    inspect it
  - uses the "run until EOF" approach, avoiding tricky games with "wait"
  - writes output filenames using echo to the pipe in parallel;
    the echoes _should_ all do single writes into the pipe to the second
    loop, and never interfere with each other in consequence

Shortcomings: if you hav too many URLs you will run out of processes (or
available connections at the target web server) by running too many
curls at once, because it will read URLs and fork/exec curls as fast as
it can. If this solves your problems in a fashion pleasing to your mind,
we can move to a more advanced token based loop to keep a maximum number
of curls in play at any one time.

How's this do you?

Cheers,
-- 
Cameron Simpson <cs@xxxxxxxxxx> DoD#743
http://www.cskk.ezoshosting.com/cs/

You can't have everything...  where would you put it?
        - Charles Robinson, cr0100@xxxxxxxxxxxxx
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux