> -----Original Message----- > From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On > Behalf Of Jens Axboe > Basically it just copies between two 32MB chunks, using whatever > implementation you would like, and in increments of some defined size. > This is what it spits out on my laptop: > >> > >> memcpy > >> 8 bytes: 3360.94 MiB/sec > >> 16 bytes: 4363.47 MiB/sec > >> 96 bytes: 6804.46 MiB/sec > >> 128 bytes: 6391.39 MiB/sec > >> 256 bytes: 6571.09 MiB/sec > >> 512 bytes: 6962.77 MiB/sec > >> 2048 bytes: 6212.73 MiB/sec > >> 8192 bytes: 6465.14 MiB/sec > >> 131072 bytes: 6412.24 MiB/sec > >> 262144 bytes: 6607.03 MiB/sec > >> 524288 bytes: 6372.90 MiB/sec > >> memmove > >> 8 bytes: 2503.90 MiB/sec > >> 16 bytes: 4311.81 MiB/sec > >> 96 bytes: 6734.74 MiB/sec > >> 128 bytes: 6080.16 MiB/sec > >> 256 bytes: 6162.92 MiB/sec > >> 512 bytes: 7309.80 MiB/sec > >> 2048 bytes: 6931.94 MiB/sec > >> 8192 bytes: 6878.97 MiB/sec > >> 131072 bytes: 6787.05 MiB/sec > >> 262144 bytes: 6877.77 MiB/sec > >> 524288 bytes: 6695.26 MiB/sec > >> simple > >> 8 bytes: 1813.59 MiB/sec > >> 16 bytes: 2191.63 MiB/sec > >> 96 bytes: 7360.76 MiB/sec > >> 128 bytes: 7192.63 MiB/sec > >> 256 bytes: 7340.00 MiB/sec > >> 512 bytes: 7158.04 MiB/sec > >> 2048 bytes: 7495.96 MiB/sec > >> 8192 bytes: 7315.30 MiB/sec > >> 131072 bytes: 7565.82 MiB/sec > >> 262144 bytes: 7410.95 MiB/sec > >> 524288 bytes: 7537.09 MiB/sec > >> > >> which is kind of depressing, since the fastest for larger sizes is > >> the very dumb and basic implementation that you'll find in any > >> text book under the section of "my first memcpy". I added a few more test functions locally. All the ones based on the STREAM benchmark use double rather than char. For comparison, I set the buffer size to 2x the L3 cache size of my Broadwell-based system, so the cache should drop out of the picture. At that size, the library versions of memcpy and memmove are 50% faster than the version that gcc compiles directly into fio (7000 MiB/s vs. 4700 MiB/s). Surprisingly, the compiler does use vmovdq with 16-byte ymm registers for both the loads and stores, like the library; unlike the library, it doesn't use prefetch and nontemporal stores. Here are the instructions used for the 135 MiB buffer size: memcpy: prefetch, vmovdq, vmovnt (ymm registers; uses __memmove_avx_unaligned_erms) memmove: prefetch, vmovdq, vmovnt (ymm registers; uses __memmove_avx_unaligned_erms) simplememcpy: vmovdq, vmovdq (ymm registers) memcsum: mov, add, mov memset: rep stos (uses __memset_avx2_erms)(36% in queued_spin_lock_slowpath) streamcopy: vmovap, vmovap (ymm registers) streamscale: vmulpd, vmovup (ymm registers) streamadd: vmovup, vaddpd, vmovup (ymm registers) streamtriad: vmovup, vfmadd, vmovup (ymm registers) For memset, the kernel is spending a lot of time on other functions - 36% in queued_spin_lock_slowpath and a huge percentage on [other] with just a hex address. This is with the brand new 4.15-rc2 kernel. Results: $ numactl -C 0 -m 0 ./fio --memcpytest memcpy compile-time options: BUF_SIZE=135 MiB, NR_INTERS=64 memcpy 8 bytes: 2530.90 MiB/sec 16 bytes: 3227.57 MiB/sec 96 bytes: 3858.30 MiB/sec 128 bytes: 3953.39 MiB/sec 256 bytes: 4139.20 MiB/sec 512 bytes: 4006.96 MiB/sec 2048 bytes: 3889.17 MiB/sec 8192 bytes: 3617.86 MiB/sec 131072 bytes: 3660.72 MiB/sec 262144 bytes: 3667.36 MiB/sec 524288 bytes: 3671.33 MiB/sec 135 MiB: 7027.01 MiB/sec memmove 8 bytes: 2535.82 MiB/sec 16 bytes: 3220.63 MiB/sec 96 bytes: 3875.11 MiB/sec 128 bytes: 3953.44 MiB/sec 256 bytes: 4147.88 MiB/sec 512 bytes: 4000.41 MiB/sec 2048 bytes: 3901.79 MiB/sec 8192 bytes: 3619.32 MiB/sec 131072 bytes: 3655.06 MiB/sec 262144 bytes: 3695.23 MiB/sec 524288 bytes: 3661.52 MiB/sec 135 MiB: 7065.83 MiB/sec simplememcpy 8 bytes: 1812.23 MiB/sec 16 bytes: 2123.49 MiB/sec 96 bytes: 1736.55 MiB/sec 128 bytes: 2215.37 MiB/sec 256 bytes: 3883.88 MiB/sec 512 bytes: 4572.59 MiB/sec 2048 bytes: 4498.80 MiB/sec 8192 bytes: 4653.40 MiB/sec 131072 bytes: 4765.23 MiB/sec 262144 bytes: 4767.77 MiB/sec 524288 bytes: 4777.56 MiB/sec 135 MiB: 4761.85 MiB/sec memset (write-only: register-to-memory write) 8 bytes: 2339.57 MiB/sec 16 bytes: 3612.22 MiB/sec 96 bytes: 4469.97 MiB/sec 128 bytes: 4735.25 MiB/sec 256 bytes: 4103.96 MiB/sec 512 bytes: 3792.41 MiB/sec 2048 bytes: 3910.08 MiB/sec 8192 bytes: 4151.58 MiB/sec 131072 bytes: 4731.21 MiB/sec 262144 bytes: 4751.02 MiB/sec 524288 bytes: 4758.88 MiB/sec 135 MiB: 4772.74 MiB/sec memcsum (read-only; csum += *s++, using uint64_t) 8 bytes: 3587.10 MiB/sec 16 bytes: 3581.16 MiB/sec 96 bytes: 3601.63 MiB/sec 128 bytes: 3604.49 MiB/sec 256 bytes: 3603.76 MiB/sec 512 bytes: 3610.62 MiB/sec 2048 bytes: 3597.99 MiB/sec 8192 bytes: 3595.70 MiB/sec 131072 bytes: 3599.04 MiB/sec 262144 bytes: 3596.95 MiB/sec 524288 bytes: 3598.67 MiB/sec 135 MiB: 3596.81 MiB/sec streamcopy (*d++ = *s++) 8 bytes: 4532.95 MiB/sec 16 bytes: 4602.78 MiB/sec 96 bytes: 4745.72 MiB/sec 128 bytes: 4705.28 MiB/sec 256 bytes: 4749.00 MiB/sec 512 bytes: 4735.48 MiB/sec 2048 bytes: 4759.82 MiB/sec 8192 bytes: 4770.35 MiB/sec 131072 bytes: 4764.58 MiB/sec 262144 bytes: 4771.19 MiB/sec 524288 bytes: 4763.98 MiB/sec 135 MiB: 4764.52 MiB/sec streamscale (*d++ = scalar * *s++) 8 bytes: 4523.42 MiB/sec 16 bytes: 4479.40 MiB/sec 96 bytes: 4692.95 MiB/sec 128 bytes: 4608.53 MiB/sec 256 bytes: 4626.64 MiB/sec 512 bytes: 4646.09 MiB/sec 2048 bytes: 4662.36 MiB/sec 8192 bytes: 4667.80 MiB/sec 131072 bytes: 4667.98 MiB/sec 262144 bytes: 4665.11 MiB/sec 524288 bytes: 4671.10 MiB/sec 135 MiB: 4671.41 MiB/sec streamadd (two reads, one write: *d++ = *s++ + *s2++) 8 bytes: 1909.03 MiB/sec 16 bytes: 2686.48 MiB/sec 96 bytes: 3034.89 MiB/sec 128 bytes: 3226.04 MiB/sec 256 bytes: 3175.74 MiB/sec 512 bytes: 3190.69 MiB/sec 2048 bytes: 3310.24 MiB/sec 8192 bytes: 3344.29 MiB/sec 131072 bytes: 3376.11 MiB/sec 262144 bytes: 3388.70 MiB/sec 524288 bytes: 3388.09 MiB/sec 135 MiB: 3394.96 MiB/sec streamtriad (two reads, one write: *d++ = *s++ + scalar * *s2++) 8 bytes: 1793.84 MiB/sec 16 bytes: 2406.70 MiB/sec 96 bytes: 3031.67 MiB/sec 128 bytes: 3226.22 MiB/sec 256 bytes: 3156.21 MiB/sec 512 bytes: 3183.24 MiB/sec 2048 bytes: 3307.65 MiB/sec 8192 bytes: 3336.36 MiB/sec 131072 bytes: 3369.90 MiB/sec 262144 bytes: 3371.26 MiB/sec 524288 bytes: 3375.22 MiB/sec 135 MiB: 3382.46 MiB/sec ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�