How to make gcc vectorize identical statements?

Yifei <hlfqdhj@xxxxxxx> · Tue, 11 Apr 2017 23:48:40 +0800 (CST)

Hi everyone,
Please consider the following code that compute sine of four variable (simultaneously),
__m256d sin4(double* a) { // ideally this shall be _mm256d as well
  for (int i = 0; i != 4; ++i)
    a[i] = sin(a[i]);
  return _mm256_loadu_pd(a);
}
Compiling with -O3 -ffast-math -march=native(haswell), gcc will optimize the code to a single call to vectorized sine.
However, for this code:
__m256d sin4(double* a){
    double b[4];
    b[0] = sin(a[0]); // b or a, doesn't matter
    b[1] = sin(a[1]);
    b[2] = sin(a[2]);
    b[3] = sin(a[3]);
    return _mm256_loadu_pd(b);
} (-O3 or even -Ofast)
gcc refuse to call the vectorized sine, and generates four scalar sine calls, complaining that 'relevant stmt not supported'. (Yet icc and cl both seem to do so.)
The two versions are simply identical, and gcc seems to only optimize loops (this is incorrect though, gcc slp vectorizer works on simple operations). But for SSE 128bit double vector, writing a loop is just too cumbersome.

I'm wondering why gcc fails to do a straightforward optimization? And how can I do, as a work around, to avoid writing an explicit loop?

Thanks,
Yifei