On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@xxxxxxx> wrote: > > Hi all, > > I tried to compile the following two code snippets with "--std=c++14 -mavx2 -O3" options: > > double tmp_values[4] = {0}; > > and > > double tmp_values[4]; > > for (auto i = 0; i < 4; ++i) { > tmp_values[i] = 0.0; > } > > The first code snippet leads to > > vmovaps XMMWORD PTR [rsp], xmm0 > vmovaps XMMWORD PTR [rsp+16], xmm0 > > But the second leads to only > > vmovapd YMMWORD PTR [rsp], ymm0 > > which is less efficient than the previous one. Am I missing something? > Assume you're working on Skylake. the latency and throuoput of vmovaps/vmovpad is | lat | throughput | uops | port | VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 | VMOVAPS (YMM, M256)| [≤5;≤8]| 0.50 / 0.50| 1 | 1*p23 | Refer to https://uops.info/table.html So the later seems better. > For the full code, see this godbolt link: https://godbolt.org/z/jonf72 , and I paste the full input and output below: > > Input code > > #include <cstring> > > double loadu1(const void* ptr, int count) { > > double tmp_values[4] = {0}; > > std::memcpy( > tmp_values, > ptr, > count * sizeof(double)); > return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3]; > } > > > double loadu2(const void* ptr, int count) { > > double tmp_values[4]; > > for (auto i = 0; i < 4; ++i) { > tmp_values[i] = 0.0; > } > > std::memcpy( > tmp_values, > ptr, > count * sizeof(double)); > return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3]; > } > > > Output assemblies: > > loadu1(void const*, int): > sub rsp, 40 > movsx rdx, esi > vpxor xmm0, xmm0, xmm0 > mov rsi, rdi > sal rdx, 3 > mov rdi, rsp > vmovaps XMMWORD PTR [rsp], xmm0 > vmovaps XMMWORD PTR [rsp+16], xmm0 > call memcpy > vmovsd xmm0, QWORD PTR [rsp] > vaddsd xmm0, xmm0, QWORD PTR [rsp+8] > vaddsd xmm0, xmm0, QWORD PTR [rsp+16] > vaddsd xmm0, xmm0, QWORD PTR [rsp+24] > add rsp, 40 > ret > loadu2(void const*, int): > push rbp > movsx rdx, esi > vxorpd xmm0, xmm0, xmm0 > mov rsi, rdi > sal rdx, 3 > mov rbp, rsp > and rsp, -32 > sub rsp, 32 > mov rdi, rsp > vmovapd YMMWORD PTR [rsp], ymm0 > vzeroupper > call memcpy > vmovsd xmm0, QWORD PTR [rsp] > vaddsd xmm0, xmm0, QWORD PTR [rsp+8] > vaddsd xmm0, xmm0, QWORD PTR [rsp+16] > vaddsd xmm0, xmm0, QWORD PTR [rsp+24] > leave > ret > > Thanks! > Hong > -- BR, Hongtao