RE: Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?

"Hong X" <hongx@xxxxxxx> · Mon, 9 Mar 2020 18:01:39 +0000

-----Hongtao Liu <crazylht@xxxxxxxxx> wrote: -----

>To: Hong X <hongx@xxxxxxx>
>From: Hongtao Liu <crazylht@xxxxxxxxx>
>Date: 03/08/2020 22:54
>Cc: gcc-help@xxxxxxxxxxx
>Subject: [EXTERNAL] Re: Initializing a vector to zero leads to less
>efficient assemblies than manually assigning a vector to zero?
>
>On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@xxxxxxx> wrote:
>>
>> Hi all,
>>
>> I tried to compile the following two code snippets with
>"--std=c++14 -mavx2 -O3" options:
>>
>>     double tmp_values[4] = {0};
>>
>> and
>>
>>     double tmp_values[4];
>>
>>     for (auto i = 0; i < 4; ++i) {
>>         tmp_values[i] = 0.0;
>>     }
>>
>> The first code snippet leads to
>>
>>     vmovaps XMMWORD PTR [rsp], xmm0
>>     vmovaps XMMWORD PTR [rsp+16], xmm0
>>
>> But the second leads to only
>>
>>     vmovapd YMMWORD PTR [rsp], ymm0
>>
>> which is less efficient than the previous one. Am I missing
>something?
>>
>Assume you're working on Skylake. the latency and throuoput of
>vmovaps/vmovpad is
>                                        | lat | throughput | uops |
>port |
>VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
>VMOVAPS (YMM, M256)| [≤5;≤8]|   0.50 / 0.50| 1 | 1*p23 |
>Refer to
>https://urldefense.proofpoint.com/v2/url?u=https-3A__uops.info_table.
>html&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=MiihJD2XQNB_CwZVDvjHBg&m=nEB
>RkuwiQXUL6Tu6accQsNS-jUQ9wCEw6jqJXNEBOes&s=zEMMNHR8du8hu3NLiODEXoXBYX
>fjaraeuP8ueYllxTM&e= 
>So the later seems better.

Oops, I said in the other way around. I meant the second is *more* (not *less* in my original post) efficient than the first despite they are functionally equivalent, but the first is likely more preferred by an average C++ programmer. This looks odd to me.

Thanks,
Hong