On Thu, Nov 18, 2010 at 10:09:56AM -0500, Genes MailLists wrote: > On 11/18/2010 09:28 AM, Jakub Jelinek wrote: > >> Downside: nothing. > > > > Downside: slower memcpy on sse4.2 machines > > Do you know how much slower in absolute time is it? > > And is it (or would it be) visible (1/10's of seconds) or invisible > (ms) in some typical (or atypical) apps that call memcpy() ... ? Depends on the application, but certainly memcpy is one of the most performance critical functions, it is used basically everywhere and heavily so, I've very often see it very high in oprofile dumps etc. For memcpy both performance with very small length is criticial (most programs call memcpy with small lengths) but many apps also copy large memory blocks around (which is where SSE*, nontemporal stores etc. play role). E.g. the latter measurably shows up on SPEC2k/SPEC2k6. It is very sad that Intel/AMD just didn't make sure rep movsb isn't the fastest copying sequence on all of their CPUs, which underneath could do whatever magic based on size and src/dst alignment (e.g. for small length handle it in hw so it is as quick as possible, for larger sizes perhaps handle it in microcode) - rep movsb can be easily inlined and is quite short as well. But on many, especially recent, CPUs it performs very badly compared to these much larger SSE* optimized routines. If you want exact numbers, best ask Intel folks who wrote and tuned the SSE4.2 memcpy routine. Jakub -- devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/devel