Hello, Consider the following test case. typedef unsigned int u32; u32 foo(const u32 *u, const u32 *v) { u32 t0 = u[0] + u[3] + u[6] + u[9]; u32 t1 = v[1] + v[3] + v[5] + v[7]; return t0 + t1; } AFAIU, for several years, x86 implementations have been able to issue two loads per cycle, and I expected gcc to compute t0 and t1 in parallel. But instead, it creates a single dependency chain. $ gcc-7 -march=skylake -O3 -S testcase.c foo: movl 12(%rsi), %eax addl 4(%rsi), %eax addl 20(%rsi), %eax addl 28(%rsi), %eax addl (%rdi), %eax addl 12(%rdi), %eax addl 24(%rdi), %eax addl 36(%rdi), %eax ret I don't think this code would benefit from SSE or auto-vectorization. But computing t0 and t1 in parallel might give a non-trivial speedup, especially for longer chains. What do you think? Regards.