Ulrich Drepper wrote: > Dominik 'Rathann' Mierzejewski wrote: >> I'd like to see a case (not involving Pentium 4) where using cmov is slower >> than not using it. It definitely is faster for decoding H.264 in FFmpeg >> for example. > > I don't have a specific test case. But I do talk to the CPU > architectures at Intel regularly. They always say the cmov should be > avoided. Especially with the introduction of the fused micro-ops the > various cmp+jcc pairs are likely move faster. Always demand measurements. See below for seven different chips which span a decade of implementation. Cmov is faster when the jxx branch predictor would fail [Pentium4 NetBurst can be an exception], and cmov wins by a very large margin on CoreDuo and Core2Duo. > And from the code generation perspective using cmp+jcc is also more > flexible. With cmov you have to tie up two registers. This is > particularly bad with the x86 ABI. The frequent case of computing minimum or maximum requires only one register: mov m(%ebp),%eax cmp n(%ebp),%eax cmova n(%ebp),%eax > There are certainly cases where cmov can be faster. Perhaps exclusively > on older micro architectures (P4s, early Core2, maybe AMD, haven't > checked). But in general it's no win. Please give measurements. Mine show that the newer the chip, the more cmov wins when the jxx branch predictor would fail. [Core i7 untested.] ----- User CPU time in seconds (smaller is better.) "for i in 1 2 3 4 5; do time ./XXXXX; done" [dual processor often reflects alternating core assignment!] cmov2 cmp-jmp2 CPU Family 6 Model 23 (Core2 Duo E8400; 3000MHz) 2.873 6.096 2.873 6.029 2.868 6.135 2.875 6.038 2.868 6.079 Family 15 Model 107 (Athlon64x2 4800+; 2500MHz) 3.182 4.433 3.529 4.433 3.184 4.432 3.543 4.437 3.182 4.428 Family 15 Model 47 (Athlon64 3200+; 2000MHz) 3.914 5.530 3.913 5.529 3.913 5.532 3.911 5.533 3.915 5.530 Family 6 Model 14 (CoreDuo 1300 [not Core2]; 1666MHz) 4.746 10.638 4.716 10.658 4.723 10.630 4.705 10.666 4.705 10.657 Family 15 Model 2 (Pentium4 Northwood; 1600MHz) 12.081 11.129 12.089 11.137 12.081 11.133 12.081 11.225 12.081 11.165 Family 6 Model 7 (AMD Duron 1200MHz) 11.894 13.370 11.939 13.322 11.912 13.358 11.814 13.320 11.913 13.379 Family 6 Model 8 (PentiumIII Coppermine; 700MHz) 16.300 16.383 16.058 16.061 16.054 16.054 16.058 16.055 16.052 16.052 ----- ----- cmov2.S; gcc -o cmov2 -nostartfiles -nostdlib cmov2.S .balign 64 sub1: mov -4(%ebp),%eax cmp -8(%ebp),%eax cmova -8(%ebp),%eax ret _start: .globl _start nop and $~0<<6,%esp mov %esp,%ebp sub $4*4,%esp mov $0x10000000 -1,%ecx mov $1,%esi mov $2,%edi jmp top .balign 64 top: mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1 mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1 mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1; call sub1 mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1; call sub1 sub $1,%ecx; jnc top sub %ebx,%ebx mov $1,%eax int $0x80 /* EOF */ ----- ----- cmp-jmp2.S; gcc -o cmp-jmp2 -nostartfiles -nostdlib cmp-jmp2.S .balign 64 sub1: mov -4(%ebp),%eax cmp -8(%ebp),%eax; jbe 0f mov -8(%ebp),%eax 0: ret _start: .globl _start nop and $~0<<6,%esp mov %esp,%ebp sub $4*4,%esp mov $0x10000000 -1,%ecx mov $1,%esi mov $2,%edi jmp top .balign 64 top: mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1 mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1 mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1; call sub1 mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1; call sub1 sub $1,%ecx; jnc top sub %ebx,%ebx mov $1,%eax int $0x80 /* EOF */ ----- -- John Reiser, jreiser@xxxxxxxxxxxx -- fedora-devel-list mailing list fedora-devel-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/fedora-devel-list