os that rather uses the gpu?

Sat Jul 17 17:08:48 UTC 2010

On Fri, 2010-07-16 at 23:04 -0400, Robert Myers wrote:
> On Fri, Jul 16, 2010 at 10:35 PM, JD <jd1008 at gmail.com> wrote:
>           
>         
>         
>         So, what would you say is/are the class/classes of problems
>         that would
>         benefit greatly from a high flops gpu, but without the sort of
>         bus
>         bandwidth you would like to see?
> 
> 
> Almost any problem that is embarrassingly parallel or nearly so is
> potentially a candidate for low-bandwidth computing.  Ray tracing is
> the primo example.  Almost any linear problem is potentially
> embarrassingly parallel, and, if you don't want to go through the work
> of exposing the embarrassingly parallel nature of the problem, there
> are the tricks that make the linpack benchmark so popular for selling
> "supercomputers" that have absurdly small bisection bandwidths.
> 
> 
> My question, though, is, if that's the kind of problem you have, why
> not do it on a distributed platform and teach students how to use
> distributed resources?  If you're Pixar, I understand why you'd want a
> well-organized farm of GPU's, but if you just want to replicate what
> LLNL (Lawrence Livermore) was doing, say, ten years ago, are you doing
> your students any favor by giving them a GPGPU instead of the
> challenge of doing real distributed computing?  Conceivably, watts per
> flop (power consumption) makes GPGPU's the hands down winner over
> distributed computing for problems that are embarrassingly parallel or
> nearly so.  Inevitably, though, people will want clusters of GPUGPU's,
> so you'll wind up doing distributed computing, anyway.
> 
> 
> If you rewrite your applications for the one-off architectures typical
> of GPU's, so that you have to do it all over again when the next
> generation comes out, have you really done yourself any favors?
> 
> 
> I don't claim that there are simple or obvious answers, but it's just
> too easy for people to be blown away by huge flops numbers.  What I'm
> afraid of has already started to happen as far as I'm concerned, which
> is that all problems will be jammed into the low-bandwidth mold,
> whether it's appropriate or not.
> 
> 
> Robert.
> 
> 
But unfortunately, Robert, networks are inherently low bandwidth.  To
achieve full throughput you still need parallelism in the networks
themselves.  I think from your description, you are discussing the fluid
models which are generally deconstructed to finite series over limited
areas for computation with established boundary conditions.  This
partitioning permits the distributed processing approach, but suffers at
the boundaries where either the boundary crossing phenomena are
discounted, or simulated, or I guess passed via some coding algorithm to
permit recursive reduction for some number of iterations.

What your field, and most others related to fluid dynamics, such as
plasma studies, explosion studies and so on, needs is full wide band
memory access across thousands of processors (millions perhaps?).

I don't pretend to understand all the implications of various
computational requirements of your field, or that of neuroprocessors,
which is another area where massive parallelism requires deep memory
access, as my own problem area is rather limited to data spaces of only
a few gigabytes, and generally serial processing is capable of dealing
with it, although not real-time.

There are a number of neat processing ideas that have applications to
specific parallel type problems, such as CAPP, SIMD, MIMD arrays, and
neural networks.  Whether these solutions will match your problem,
likely depends a great deal on your view of the problem.  As in most
endeavors in life, our vision of the solution is as you speak of the
others here, limited by our view of the problem.

Very few people outside the realm of computation analysis ever deal with
the choices of algorithms, architecture, real through put, processing
limitations, bandwidth limitations, data access and distribution
problems, and so on.  Fewer still deal with massive quantities of data.
Search engines deal with some of the issues, people like google deal
with all kinds of distributed problems across data spaces that dwarf the
efforts of even most sciences.  Some algorithms, as you point out, have
limitations of the Multiple Instruction, Multiple Data sort, which place
great demands on memory bandwidth, and processor speeds, as well as
interprocess communications.  

But saying that a particular architecture is unfit for an application
means that you have to understand both the application, and the
architecture.  These are both difficult today, as the architectures are
changing about every 11 months or maybe less right now.  Computation via
interferometry for example is one of the old (new) fields where a known
capability is only now capable of being explored.  Optical computing,
3d shading, broadcast 3d, and other immersion technologies add new
requirements to the GPGU's being discussed.  Motion coupled with 3d
means that the shaders and other elements need greater processing power,
and greater through put.  Their architecture is undergoing severe
redesign.  Even microcontrollers are expanding their reach via multiple
processors (via things like the propeller chip for example).  

	I am a test applications consultant.  My trade forces me to
continuously update my skills and try to keep up with the multitude of
new architectures.  I have almost no free time, as researching new
architectures, designing new algorithms, understanding the application
of algorithms to new tests, and hardware design requirements eats up
many hours every day. Fortunately I am and always have been a NERD and
proud of it.  I work at it.

Since you have a deep knowledge of your requirements, perhaps you should
put some time into thinking of a design methodology other than those I
have mentioned or those that you know, in an attempt to improve the
science.  I am sure many others would be very appreciative, and quite
likely supportive.

Regards,
Les H