When I enable the tree-loop-vectorize optimization I'm seeing some behavior that I don't understand... Here is a test program that highlights the scenario: #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <sys/stat.h> int main(){ int a = 2; int *buffer = malloc(10000 * sizeof(*buffer)); for (int i = 0; i < 10000; i++){ if ((i % 1000) == 0){ a = a * 15; } buffer[i] = a; } int fd = open("output", O_CREAT | O_WRONLY); write(fd, buffer, 10000 * sizeof(*buffer)); close(fd); return 0; } When compiled with the option disabled (gcc -std=c99 -O3 -fno-tree-loop-vectorize test.c -S -masm=intel) the following code is produced for the loop: mov ebx, 2 ; ebx is a mov edi, 274877907 ... call malloc ... mov esi, eax ; esi is the buffer pointer xor ecx, ecx ; ecx is i ... .L3: mov eax, ecx imul edi mov eax, ecx sar eax, 31 sar edx, 6 sub edx, eax imul edx, edx, 1000 ; Compute optimized modulus - see hackersdelight.org/magic.htm cmp ecx, edx jne .L2 mov eax, ebx sal eax, 4 sub eax, ebx mov ebx, eax ; Compute optimized multiplication of a by 15 .L2: mov DWORD PTR [esi+ecx*4], ebx add ecx, 1 cmp ecx, 10000 jne .L3 When the tree-loop-vectorize option is enabled (gcc -std=c99 -O3 -ftree-loop-vectorize test.c -S -masm=intel), though, the following code is generated: mov ebx, 2 ... call malloc ... mov edi, eax xor ecx, ecx ... .L2: mov eax, 274877907 mov esi, ebx imul ecx mov eax, ecx sal esi, 4 sar eax, 31 sub esi, ebx sar edx, 6 sub edx, eax imul edx, edx, 1000 cmp ecx, edx cmove ebx, esi mov DWORD PTR [edi+ecx*4], ebx add ecx, 1 cmp ecx, 10000 jne .L2 The main difference here is that the 'a * 15' calculation is done every loop iteration instead of every 1000th, but a is only assigned the new value every 1000th time via the conditional move instruction. This example isn't too crazy, but performance seems to worsen if you are doing more complex operations inside the if block (try 'a = a*a*a*a*a;') and/or if you switch to 64-bit data types (try changing 'a' and 'buffer' to use long long instead of int - the conditional move gets replaced with two moves after the modulus check branch, and you need more computational instructions to compensate for the larger-than-register data-size arithmetic.) Anyway, my question is why the optimization does this. Why not keep the modulus check at the top and bypass the calculations each loop iteration? It might be compelling to get rid of the branch in the case above, but in the 64-bit data types case this benefit goes away. This example program is a bit contrived, but I saw this behavior in "real" code and was trying to understand it. If anyone could share some insight into this I'd greatly appreciate it! -Andrew One last thing to note - The assembly above was generated using gcc version 4.9.2 20141224 (prerelease), the one that Arch Linux currently has in it's software repo.