Fixing the glibc adobe flash incompatibility

Thu Nov 18 15:23:56 UTC 2010

On Thu, Nov 18, 2010 at 10:09:56AM -0500, Genes MailLists wrote:
> On 11/18/2010 09:28 AM, Jakub Jelinek wrote:
> >> Downside: nothing.
> > 
> > Downside: slower memcpy on sse4.2 machines
> 
>   Do you know how much slower in absolute time is it?
> 
>   And is it (or would it be) visible (1/10's of seconds) or invisible
> (ms) in some typical (or atypical) apps that call memcpy() ... ?

Depends on the application, but certainly memcpy is one of the most
performance critical functions, it is used basically everywhere and heavily
so, I've very often see it very high in oprofile dumps etc.  For memcpy both
performance with very small length is criticial (most programs call memcpy
with small lengths) but many apps also copy large memory blocks around (which
is where SSE*, nontemporal stores etc. play role).  E.g. the latter measurably
shows up on SPEC2k/SPEC2k6.

It is very sad that Intel/AMD just didn't make sure rep movsb
isn't the fastest copying sequence on all of their CPUs,
which underneath could do whatever magic based on size and src/dst
alignment (e.g. for small length handle it in hw so it is as quick as
possible, for larger sizes perhaps handle it in microcode) - rep movsb
can be easily inlined and is quite short as well.  But on many, especially 
recent, CPUs it performs very badly compared to these much larger SSE* optimized
routines.

If you want exact numbers, best ask Intel folks who wrote and tuned the
SSE4.2 memcpy routine.

	Jakub