Re: Good tutorial on setting up a grid/cluster using fedora

Wednesday, 2 April 2014

On Wed, 2 Apr 2014, Greg Woods wrote:

...

 My experience says there isn't. Granted I am not an expert in parallel
 computing, but I work for a supercomputing site. About 15 years ago,
 high performance computing hit the wall with regard to how fast a single
 processor can be. We had CRAY computers that used vector processing;
 that means executing the same instructions on a range of memory words at
 the same time in one instruction cycle. This means that code like

 for i = 1,100 do
 a[i]=a[i]*2
 done

 would execute at the same speed as "x=x*2" (in this admittedly trivial
 example, you get a factor of 100 speedup).  That was a lot easier to
 program for than multiprocessing, but even that required careful
 attention when writing code so that it would vectorize and get the
 performance boost.

 After single processor computing hit the wall, we and every other HPC
 site had to go to parallel processing (modern supercomputers have tens
 of thousands of processors running on thousands of separate nodes). This
 too requires special coding, so that your program will naturally break
 up into separate tasks that can be executed in parallel. That is true
 whether you are talking about using multiple processors on a single
 machine, or spreading a code over multiple systems. There are MPI
 libraries to make this task easier, but it is never as simple as "OK,
 now execute this unmodified code five times as fast using five machines
 instead of one".

 How difficult it is to parallelize the code depends, as has already been
 said here, on the particular application to be parallelized.

 --Greg

Right.  A lot of image processing tasks are amenable to parallelization.

Consider an algorithm called "adaptive histogram equalization."  What this does
is take:

1) Get a pixel and a small area around it (say the surrounding 100 pixels).

2) Do a contrast enhancement method called "histogram equalization" on that
group of pixels.  This will change the value of the pixel in question.  Let's say that
this process involves 500 high-level instructions.

3) Move to the next pixel.  Do the same thing.

If you have a 12-megapixel image (say, 11,760,000 pixels), that's 5,880,000,000
instructions.  That 500 instruction block is impossible to parallelize well.  However,
each pixel is independent, so you can parallelize the work on each pixel easily.  I
remember back in the 80s implementing this on a microVAX GPX II.  It took about 3 hours to
do a 512x512 greyscale image by brute force.  Then Henry Fuchs et al. developed the
PixelPlanes machine, and Austin et al.  implemented it on that -- it took about 4 seconds.
 Even today on my laptop with an i7, a brute-force contrast-limited adaptive histogram
equalization on a 10 megapixel image takes a "go get a cup of coffe" time
period.  There are, of course, short cuts such as the Pizer-Cromartie algorithm, but they
introduce interpolation artifacts.

Of course, that's why we have GPUs now, and most of this stuff is done on the GPU
using CUDA.

Oh well, as I said, I remember back in the day trying to build a Beowulf cluster and
deciding that it just wasn't worth the effort.  I was hoping that new tools were
around to make it easier, with all the new advances in cloud and virtualization, but no
such luck, it seems.

billo