RE: memcpy test: results from adding sse and avx tests

"Elliott, Robert (Persistent Memory)" <elliott@xxxxxxx> · Thu, 18 Jan 2018 21:22:38 +0000

> -----Original Message-----
> From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On
> Behalf Of Jens Axboe
> Sent: Thursday, January 18, 2018 11:47 AM
> To: Rebecca Cran <rebecca@xxxxxxxxxxxx>; fio@xxxxxxxxxxxxxxx
> Subject: Re: memcpy test: results from adding sse and avx tests
> 
> On 1/18/18 10:38 AM, Rebecca Cran wrote:
> > I added code to lib/memcpy.c to test sse and avx performance, and found
> > that on modern systems memcpy outperforms both by quite some margin
> > (GB/s) on the larger block sizes: the only place sse/avx is an
> > improvement was on an older SandyBridge EP system - I've copied the
> > output below.
> >
> > Should I work on a patch to commit the changes, or just abandon them
> > since it seems the current memcpy implementation used in the mmap engine
> > is the best solution on modern machines?
> 
> The upside would be having an implementation that is independent of
> the OS, the downside is the (significant) extra maintenance burden
> and the differing results on different machines.
> 
> The synthetic test case is a bit misleading, I think. avx/sse might
> yield great results for small sizes, but in actual workloads, having
> to save/restore state across context switches will add overhead. The
> simple throughput test case doesn't include that.
> 
> Adding the memcpy for avx/sse to the test case might be interesting
> though, just to be able to compare performances with builtin
> memcpy/memmove on a given system.

glibc has evolved to choosing those code paths only for transfers that
are larger than some percentage of the L3 cache size (the equation has
changed over time), using non-temporal stores to avoid filling the cache
with the new data.

Example results on a system with 45 MiB L3 cache:

memcpy
          4 bytes:       1217.99 MiB/s
          8 bytes:       2414.37 MiB/s
         16 bytes:       3185.52 MiB/s
         32 bytes:       3445.46 MiB/s
         64 bytes:       3687.72 MiB/s
         96 bytes:       3873.44 MiB/s
        128 bytes:       4200.43 MiB/s
        256 bytes:       4115.74 MiB/s
        512 bytes:       4348.14 MiB/s
          2 KiB:         4295.60 MiB/s
          4 KiB:         4328.30 MiB/s
          8 KiB:         3238.48 MiB/s
        128 KiB:         3628.87 MiB/s
        256 KiB:         3392.96 MiB/s
        512 KiB:         3638.49 MiB/s
          8 MiB:         3596.26 MiB/s
        6x 1.375 MiB:    3639.94 MiB/s
          9 MiB:         3635.08 MiB/s
         16 MiB:         3637.19 MiB/s
         17 MiB:         3603.30 MiB/s
         22 MiB:         3414.47 MiB/s
         32 MiB:         3639.77 MiB/s
         40 MiB:         6528.04 MiB/s
         48 MiB:         7258.69 MiB/s
        128 MiB:         7271.92 MiB/s
        full buffer:     7260.11 MiB/s

glibc-2.26 added support for a GLIBC_TUNABLES environment variable
that can be used to adjust the non-temporal threshold for some of
the library functions like memcpy().  The syntax looks like:
	GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=2097152
or
	GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=131072:glibc.tune.hwcaps=AVX2_Usable,ERMS,-Prefer_No_VZEROUPPER,AVX_Fast_Unaligned_Load

Example results setting the threshold very low (4 KiB) shows
performance continuing to scale based on transfer size:

export GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=4096
numactl -C 1 -m 0 ./fio --memcpytest
memcpy
          4 bytes:       1210.09 MiB/s
          8 bytes:       2413.12 MiB/s
         16 bytes:       3192.43 MiB/s
         32 bytes:       3471.57 MiB/s
         64 bytes:       3577.36 MiB/s
         96 bytes:       3897.51 MiB/s
        128 bytes:       4035.04 MiB/s
        256 bytes:       4263.93 MiB/s
        512 bytes:       4276.74 MiB/s
          2 KiB:         4197.32 MiB/s
          4 KiB:         4227.21 MiB/s
          8 KiB:         5360.31 MiB/s
        128 KiB:         6903.83 MiB/s
        256 KiB:         6972.99 MiB/s
        512 KiB:         7000.79 MiB/s
          8 MiB:         7024.60 MiB/s
        6x 1.375 MiB:    7018.43 MiB/s
          9 MiB:         7023.53 MiB/s
         16 MiB:         7031.22 MiB/s
         17 MiB:         7022.85 MiB/s
         22 MiB:         7017.37 MiB/s
         32 MiB:         7035.03 MiB/s
         40 MiB:         7027.72 MiB/s
         48 MiB:         7021.17 MiB/s
        128 MiB:         7038.09 MiB/s
        full buffer:     7022.51 MiB/s

I'm working on a patch set to run these (and more) memory tests
within an mmap()ed region, which is especially useful for
persistent memory that doesn't have the same performance as
regular memory.  I'll try to post an RFC of that series soon.

---
Robert Elliott, HPE Persistent Memory

��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�