off topic: combined output of concurrent processes

Sun Apr 15 14:30:27 UTC 2012

> 
> Look at this (completely untested) loop:
> 
>   # a little setup
>   cmd=`basename "$0"`
>   : ${TMPDIR:=/tmp}
>   tmppfx=$TMPDIR/$cmd.$$
> 
>   i=0
>   while read -r url
>   do
>     i=$((i+1))
>     out=$tmppfx.$i
>     if curl -s "$url" >"$out"
>     then  echo "$out"
>     else  echo "$cmd: curl fails on: $url" >&2 fi &
>   done < myURLs \
>   | while read -r out
>     do
>       cat "$out"
>       rm "$out"
>     done \
>   | tee all-data.out \
>   | your-data-parsing-program

I understand the script, although I haven't tested it either. My take on 
it: 
	+ it solves the problem of curls overwriting (I think) 
	+ the data parsing and tracking is done on the combined curls
	- it retrieves the urls serially, not in parallel
	- it writes them to disk
	- it re-reads them from disk, hence some disk activity, although 
probably insignificant relative to the download time.

The way I'm doing it now is this: I do the retrieval and the parsing and 
tracking all within a single program. For each url I create a separate 
thread from which I call curl and get its output, then parse. Like this:

// inside each thread:

const size_t bufSize_ = 1<<20;   // 1Mb, sufficiently large 
char buf_[bufSize_];             // local to each thread

// form the curl command = "curl -s $url"

fd_=popen(curl_command, "r");   // omit error checking here
nRead_=fread(buf_, sizeof(char), bufSize_, fd_); 
pclose(fd_);

parse(buf_);                  // to a struct visible from all threads

// when threads done, analyze the combined info.

This works, but I would have liked a more modular solution. I want the url 
retrieval to be a separate, standalone entity and the parsing and 
tracking another entity (possibly two entities). Hence, what I want is 

- in a shell
	- download in parallel 
	- merge curl outputs

then pipe into the parser/tracker. Parsing can be done per url, but 
tracking MUST be across urls.