RE: memcpy test

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On
> Behalf Of Jens Axboe
> Basically it just copies between two 32MB chunks, using whatever
> implementation you would like, and in increments of some defined size. 
> This is what it spits out on my laptop:
> >>
> >> memcpy
> >> 	8 bytes:	 3360.94 MiB/sec
> >> 	16 bytes:	 4363.47 MiB/sec
> >> 	96 bytes:	 6804.46 MiB/sec
> >> 	128 bytes:	 6391.39 MiB/sec
> >> 	256 bytes:	 6571.09 MiB/sec
> >> 	512 bytes:	 6962.77 MiB/sec
> >> 	2048 bytes:	 6212.73 MiB/sec
> >> 	8192 bytes:	 6465.14 MiB/sec
> >> 	131072 bytes:	 6412.24 MiB/sec
> >> 	262144 bytes:	 6607.03 MiB/sec
> >> 	524288 bytes:	 6372.90 MiB/sec
> >> memmove
> >> 	8 bytes:	 2503.90 MiB/sec
> >> 	16 bytes:	 4311.81 MiB/sec
> >> 	96 bytes:	 6734.74 MiB/sec
> >> 	128 bytes:	 6080.16 MiB/sec
> >> 	256 bytes:	 6162.92 MiB/sec
> >> 	512 bytes:	 7309.80 MiB/sec
> >> 	2048 bytes:	 6931.94 MiB/sec
> >> 	8192 bytes:	 6878.97 MiB/sec
> >> 	131072 bytes:	 6787.05 MiB/sec
> >> 	262144 bytes:	 6877.77 MiB/sec
> >> 	524288 bytes:	 6695.26 MiB/sec
> >> simple
> >> 	8 bytes:	 1813.59 MiB/sec
> >> 	16 bytes:	 2191.63 MiB/sec
> >> 	96 bytes:	 7360.76 MiB/sec
> >> 	128 bytes:	 7192.63 MiB/sec
> >> 	256 bytes:	 7340.00 MiB/sec
> >> 	512 bytes:	 7158.04 MiB/sec
> >> 	2048 bytes:	 7495.96 MiB/sec
> >> 	8192 bytes:	 7315.30 MiB/sec
> >> 	131072 bytes:	 7565.82 MiB/sec
> >> 	262144 bytes:	 7410.95 MiB/sec
> >> 	524288 bytes:	 7537.09 MiB/sec
> >>
> >> which is kind of depressing, since the fastest for larger sizes is
> >> the very dumb and basic implementation that you'll find in any
> >> text book under the section of "my first memcpy".

I added a few more test functions locally. All the ones based on the
STREAM benchmark use double rather than char.

For comparison, I set the buffer size to 2x the L3 cache size of
my Broadwell-based system, so the cache should drop out of the 
picture. At that size, the library versions of memcpy and
memmove are 50% faster than the version that gcc compiles
directly into fio (7000 MiB/s vs. 4700 MiB/s).

Surprisingly, the compiler does use vmovdq with 16-byte ymm
registers for both the loads and stores, like the library;
unlike the library, it doesn't use prefetch and nontemporal
stores.

Here are the instructions used for the 135 MiB buffer size:
      memcpy: prefetch, vmovdq, vmovnt (ymm registers; uses __memmove_avx_unaligned_erms)
     memmove: prefetch, vmovdq, vmovnt (ymm registers; uses __memmove_avx_unaligned_erms)
simplememcpy: vmovdq, vmovdq (ymm registers)
     memcsum: mov, add, mov
      memset: rep stos (uses __memset_avx2_erms)(36% in queued_spin_lock_slowpath)
  streamcopy: vmovap, vmovap (ymm registers)
 streamscale: vmulpd, vmovup (ymm registers)
   streamadd: vmovup, vaddpd, vmovup (ymm registers)
 streamtriad: vmovup, vfmadd, vmovup (ymm registers)

For memset, the kernel is spending a lot of time on other
functions - 36% in queued_spin_lock_slowpath and a huge 
percentage on [other] with just a hex address.  This is with
the brand new 4.15-rc2 kernel.

Results:
$ numactl -C 0 -m 0 ./fio --memcpytest
memcpy compile-time options: BUF_SIZE=135 MiB, NR_INTERS=64
memcpy
        8 bytes:         2530.90 MiB/sec
        16 bytes:        3227.57 MiB/sec
        96 bytes:        3858.30 MiB/sec
        128 bytes:       3953.39 MiB/sec
        256 bytes:       4139.20 MiB/sec
        512 bytes:       4006.96 MiB/sec
        2048 bytes:      3889.17 MiB/sec
        8192 bytes:      3617.86 MiB/sec
        131072 bytes:    3660.72 MiB/sec
        262144 bytes:    3667.36 MiB/sec
        524288 bytes:    3671.33 MiB/sec
        135 MiB:         7027.01 MiB/sec
memmove
        8 bytes:         2535.82 MiB/sec
        16 bytes:        3220.63 MiB/sec
        96 bytes:        3875.11 MiB/sec
        128 bytes:       3953.44 MiB/sec
        256 bytes:       4147.88 MiB/sec
        512 bytes:       4000.41 MiB/sec
        2048 bytes:      3901.79 MiB/sec
        8192 bytes:      3619.32 MiB/sec
        131072 bytes:    3655.06 MiB/sec
        262144 bytes:    3695.23 MiB/sec
        524288 bytes:    3661.52 MiB/sec
        135 MiB:         7065.83 MiB/sec
simplememcpy
        8 bytes:         1812.23 MiB/sec
        16 bytes:        2123.49 MiB/sec
        96 bytes:        1736.55 MiB/sec
        128 bytes:       2215.37 MiB/sec
        256 bytes:       3883.88 MiB/sec
        512 bytes:       4572.59 MiB/sec
        2048 bytes:      4498.80 MiB/sec
        8192 bytes:      4653.40 MiB/sec
        131072 bytes:    4765.23 MiB/sec
        262144 bytes:    4767.77 MiB/sec
        524288 bytes:    4777.56 MiB/sec
        135 MiB:         4761.85 MiB/sec
memset (write-only: register-to-memory write)
        8 bytes:         2339.57 MiB/sec
        16 bytes:        3612.22 MiB/sec
        96 bytes:        4469.97 MiB/sec
        128 bytes:       4735.25 MiB/sec
        256 bytes:       4103.96 MiB/sec
        512 bytes:       3792.41 MiB/sec
        2048 bytes:      3910.08 MiB/sec
        8192 bytes:      4151.58 MiB/sec
        131072 bytes:    4731.21 MiB/sec
        262144 bytes:    4751.02 MiB/sec
        524288 bytes:    4758.88 MiB/sec
        135 MiB:         4772.74 MiB/sec
memcsum (read-only; csum += *s++, using uint64_t)
        8 bytes:         3587.10 MiB/sec
        16 bytes:        3581.16 MiB/sec
        96 bytes:        3601.63 MiB/sec
        128 bytes:       3604.49 MiB/sec
        256 bytes:       3603.76 MiB/sec
        512 bytes:       3610.62 MiB/sec
        2048 bytes:      3597.99 MiB/sec
        8192 bytes:      3595.70 MiB/sec
        131072 bytes:    3599.04 MiB/sec
        262144 bytes:    3596.95 MiB/sec
        524288 bytes:    3598.67 MiB/sec
        135 MiB:         3596.81 MiB/sec
streamcopy (*d++ = *s++)
        8 bytes:         4532.95 MiB/sec
        16 bytes:        4602.78 MiB/sec
        96 bytes:        4745.72 MiB/sec
        128 bytes:       4705.28 MiB/sec
        256 bytes:       4749.00 MiB/sec
        512 bytes:       4735.48 MiB/sec
        2048 bytes:      4759.82 MiB/sec
        8192 bytes:      4770.35 MiB/sec
        131072 bytes:    4764.58 MiB/sec
        262144 bytes:    4771.19 MiB/sec
        524288 bytes:    4763.98 MiB/sec
        135 MiB:         4764.52 MiB/sec
streamscale (*d++ = scalar * *s++)
        8 bytes:         4523.42 MiB/sec
        16 bytes:        4479.40 MiB/sec
        96 bytes:        4692.95 MiB/sec
        128 bytes:       4608.53 MiB/sec
        256 bytes:       4626.64 MiB/sec
        512 bytes:       4646.09 MiB/sec
        2048 bytes:      4662.36 MiB/sec
        8192 bytes:      4667.80 MiB/sec
        131072 bytes:    4667.98 MiB/sec
        262144 bytes:    4665.11 MiB/sec
        524288 bytes:    4671.10 MiB/sec
        135 MiB:         4671.41 MiB/sec
streamadd (two reads, one write: *d++ = *s++ + *s2++)
        8 bytes:         1909.03 MiB/sec
        16 bytes:        2686.48 MiB/sec
        96 bytes:        3034.89 MiB/sec
        128 bytes:       3226.04 MiB/sec
        256 bytes:       3175.74 MiB/sec
        512 bytes:       3190.69 MiB/sec
        2048 bytes:      3310.24 MiB/sec
        8192 bytes:      3344.29 MiB/sec
        131072 bytes:    3376.11 MiB/sec
        262144 bytes:    3388.70 MiB/sec
        524288 bytes:    3388.09 MiB/sec
        135 MiB:         3394.96 MiB/sec
streamtriad (two reads, one write: *d++ = *s++ + scalar * *s2++)
        8 bytes:         1793.84 MiB/sec
        16 bytes:        2406.70 MiB/sec
        96 bytes:        3031.67 MiB/sec
        128 bytes:       3226.22 MiB/sec
        256 bytes:       3156.21 MiB/sec
        512 bytes:       3183.24 MiB/sec
        2048 bytes:      3307.65 MiB/sec
        8192 bytes:      3336.36 MiB/sec
        131072 bytes:    3369.90 MiB/sec
        262144 bytes:    3371.26 MiB/sec
        524288 bytes:    3375.22 MiB/sec
        135 MiB:         3382.46 MiB/sec



��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux