Re: off topic: combined output of concurrent processes

Cameron Simpson <cs@xxxxxxxxxx> · Sun, 15 Apr 2012 18:23:33 +1000

[ Potential solution at bottom of post. ]

Ok, another question: is curl using ONLY the -s (silent) option? And
specificly, it is NOT using -C (continue)?

Curl can behave differently if its output is not a terminal. Aside from
turning off the progress if the output _is_ a terminal, with -C it uses
the output file to figure out how much data to request; if it thinks the
file is already patially fetched it won't fetch the front part.

Finally, it is conceivable that curl might seek() the file.

The reason I suggest this is that you said using a >> made it good.
Normally (with a single output file, opened just the once) they would
behave the same. But suppose curl, internally, seek()ing to a particular
position. With a file opened for append that does nothing (the next
write will go at the end of the file anyway) but if not then the seek
would reposition the file pointer and overwrites would occur.

Curl's got no good reason to do that (even with a -C option), but it
might; if we suspect this some tests using the strace command can tell
us.

However, we can work around this whole issue and solve two problems:
  - the sharing of th output file, which we _suspect_ may be triggering
    bad behaviour from curl
  - the possible interleaving of curl outputs: curl _will_ get data from
    the URL in chunks, and parallel curls will interleave their output
    chunks

Look at this (completely untested) loop:

  # a little setup
  cmd=`basename "$0"`
  : ${TMPDIR:=/tmp}
  tmppfx=$TMPDIR/$cmd.$$

  i=0
  while read -r url
  do
    i=$((i+1))
    out=$tmppfx.$i
    if curl -s "$url" >"$out"
    then  echo "$out"
    else  echo "$cmd: curl fails on: $url" >&2
    fi &
  done < myURLs \
  | while read -r out
    do
      cat "$out"
      rm "$out"
    done \
  | tee all-data.out \
  | your-data-parsing-program

This program does a few things:
  - gives each curl its own output file, avoiding our issues
  - runs them all in parallel, achieving your aim
  - never interleaves one curl with another;
    the second loop reads each completed output file in turn, completely
  - takes a copy using tee to the file all-data.out, just so you can
    inspect it
  - uses the "run until EOF" approach, avoiding tricky games with "wait"
  - writes output filenames using echo to the pipe in parallel;
    the echoes _should_ all do single writes into the pipe to the second
    loop, and never interfere with each other in consequence

Shortcomings: if you hav too many URLs you will run out of processes (or
available connections at the target web server) by running too many
curls at once, because it will read URLs and fork/exec curls as fast as
it can. If this solves your problems in a fashion pleasing to your mind,
we can move to a more advanced token based loop to keep a maximum number
of curls in play at any one time.

How's this do you?

Cheers,
-- 
Cameron Simpson <cs@xxxxxxxxxx> DoD#743
http://www.cskk.ezoshosting.com/cs/

You can't have everything...  where would you put it?
        - Charles Robinson, cr0100@xxxxxxxxxxxxx
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org