off topic: combined output of concurrent processes
Amadeus W.M.
amadeus84 at verizon.net
Sun Apr 15 14:30:27 UTC 2012
>
> Look at this (completely untested) loop:
>
> # a little setup
> cmd=`basename "$0"`
> : ${TMPDIR:=/tmp}
> tmppfx=$TMPDIR/$cmd.$$
>
> i=0
> while read -r url
> do
> i=$((i+1))
> out=$tmppfx.$i
> if curl -s "$url" >"$out"
> then echo "$out"
> else echo "$cmd: curl fails on: $url" >&2 fi &
> done < myURLs \
> | while read -r out
> do
> cat "$out"
> rm "$out"
> done \
> | tee all-data.out \
> | your-data-parsing-program
I understand the script, although I haven't tested it either. My take on
it:
+ it solves the problem of curls overwriting (I think)
+ the data parsing and tracking is done on the combined curls
- it retrieves the urls serially, not in parallel
- it writes them to disk
- it re-reads them from disk, hence some disk activity, although
probably insignificant relative to the download time.
The way I'm doing it now is this: I do the retrieval and the parsing and
tracking all within a single program. For each url I create a separate
thread from which I call curl and get its output, then parse. Like this:
// inside each thread:
const size_t bufSize_ = 1<<20; // 1Mb, sufficiently large
char buf_[bufSize_]; // local to each thread
// form the curl command = "curl -s $url"
fd_=popen(curl_command, "r"); // omit error checking here
nRead_=fread(buf_, sizeof(char), bufSize_, fd_);
pclose(fd_);
parse(buf_); // to a struct visible from all threads
// when threads done, analyze the combined info.
This works, but I would have liked a more modular solution. I want the url
retrieval to be a separate, standalone entity and the parsing and
tracking another entity (possibly two entities). Hence, what I want is
- in a shell
- download in parallel
- merge curl outputs
then pipe into the parser/tracker. Parsing can be done per url, but
tracking MUST be across urls.
More information about the users
mailing list