off topic: combined output of concurrent processes

Cameron Simpson cs at zip.com.au
Sun Apr 15 21:38:13 UTC 2012


On 15Apr2012 14:30, Amadeus W.M. <amadeus84 at verizon.net> wrote:
| > Look at this (completely untested) loop:
| > 
| >   # a little setup
| >   cmd=`basename "$0"`
| >   : ${TMPDIR:=/tmp}
| >   tmppfx=$TMPDIR/$cmd.$$
| > 
| >   i=0
| >   while read -r url
| >   do
| >     i=$((i+1))
| >     out=$tmppfx.$i
| >     if curl -s "$url" >"$out"
| >     then  echo "$out"
| >     else  echo "$cmd: curl fails on: $url" >&2 fi &
| >   done < myURLs \
| >   | while read -r out
| >     do
| >       cat "$out"
| >       rm "$out"
| >     done \
| >   | tee all-data.out \
| >   | your-data-parsing-program
| 
| 
| I understand the script, although I haven't tested it either. My take on 
| it: 
| 	+ it solves the problem of curls overwriting (I think) 
| 	+ the data parsing and tracking is done on the combined curls

Yes.

| 	- it retrieves the urls serially, not in parallel

No, in parallel. There is an "&" after the "fi" in the if.

It looks like the "fi &" got sucked onto the end of an echo statemnet.
It should be on its own line.

| 	- it writes them to disk

Just long enough to be read and catted, then removed.

| 	- it re-reads them from disk, hence some disk activity, although 
| probably insignificant relative to the download time.

Should be, yes.

| The way I'm doing it now is this: I do the retrieval and the parsing and 
| tracking all within a single program. For each url I create a separate 
| thread from which I call curl and get its output, then parse.
| Like this:
| 
| // inside each thread:
[... popen(curl...) ...]
| // when threads done, analyze the combined info.
| 
| This works, but I would have liked a more modular solution. I want the url 
| retrieval to be a separate, standalone entity and the parsing and 
| tracking another entity (possibly two entities). Hence, what I want is 
| 
| - in a shell
| 	- download in parallel 
| 	- merge curl outputs

My above loop tries to do that. The curls do run in parallel.

| then pipe into the parser/tracker. Parsing can be done per url, but 
| tracking MUST be across urls. 

That should work; your parser comes at the end of the pipeline.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

[Alain] had been looking at his dashboard, and had not seen me, so I
ran into him. - Jean Alesi on his qualifying prang at Imola '93


More information about the users mailing list