cmov vs cmp+jxx across x86 CPU implementations

Wed Feb 4 17:13:58 UTC 2009

Ulrich Drepper wrote:
> Dominik 'Rathann' Mierzejewski wrote:
>> I'd like to see a case (not involving Pentium 4) where using cmov is slower
>> than not using it. It definitely is faster for decoding H.264 in FFmpeg
>> for example.
> 
> I don't have a specific test case.  But I do talk to the CPU
> architectures at Intel regularly.  They always say the cmov should be
> avoided.  Especially with the introduction of the fused micro-ops the
> various cmp+jcc pairs are likely move faster.

Always demand measurements.  See below for seven different chips which
span a decade of implementation.  Cmov is faster when the jxx branch predictor
would fail [Pentium4 NetBurst can be an exception], and cmov wins by a very
large margin on CoreDuo and Core2Duo.

> And from the code generation perspective using cmp+jcc is also more
> flexible.  With cmov you have to tie up two registers.  This is
> particularly bad with the x86 ABI.

The frequent case of computing minimum or maximum requires only one register:
	mov   m(%ebp),%eax
	cmp   n(%ebp),%eax
	cmova n(%ebp),%eax

> There are certainly cases where cmov can be faster.  Perhaps exclusively
> on older micro architectures (P4s, early Core2, maybe AMD, haven't
> checked).  But in general it's no win.

Please give measurements.  Mine show that the newer the chip,
the more cmov wins when the jxx branch predictor would fail.
[Core i7 untested.]

-----
User CPU time in seconds (smaller is better.)
"for i in 1 2 3 4 5; do time ./XXXXX; done"
[dual processor often reflects alternating core assignment!]

cmov2	cmp-jmp2     CPU
		Family 6 Model 23 (Core2 Duo E8400; 3000MHz)
2.873	6.096
2.873	6.029
2.868	6.135
2.875	6.038
2.868	6.079

		Family 15 Model 107 (Athlon64x2 4800+; 2500MHz)
3.182	4.433
3.529	4.433
3.184	4.432
3.543	4.437
3.182	4.428

		Family 15 Model 47 (Athlon64 3200+; 2000MHz)
3.914	5.530
3.913	5.529
3.913	5.532
3.911	5.533
3.915	5.530

		Family 6 Model 14 (CoreDuo 1300 [not Core2]; 1666MHz)
4.746	10.638
4.716	10.658
4.723	10.630
4.705	10.666
4.705	10.657

		Family 15 Model 2 (Pentium4 Northwood; 1600MHz)
12.081	11.129
12.089	11.137
12.081	11.133
12.081	11.225
12.081	11.165

		Family 6 Model 7 (AMD Duron 1200MHz)
11.894	13.370
11.939	13.322
11.912	13.358
11.814	13.320
11.913	13.379

		Family 6 Model 8 (PentiumIII Coppermine; 700MHz)
16.300	16.383
16.058	16.061
16.054	16.054
16.058	16.055
16.052	16.052
-----

----- cmov2.S;  gcc -o cmov2 -nostartfiles -nostdlib cmov2.S
	.balign 64
sub1:
	mov   -4(%ebp),%eax
	cmp   -8(%ebp),%eax
	cmova -8(%ebp),%eax
	ret

_start: .globl _start
	nop
	and $~0<<6,%esp
	mov %esp,%ebp
	sub $4*4,%esp
	mov $0x10000000 -1,%ecx
	mov $1,%esi
	mov $2,%edi
	jmp top

	.balign 64
top:
	mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1
	mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1
	mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1; call sub1
	mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1; call sub1
	sub $1,%ecx; jnc top

	sub %ebx,%ebx
	mov $1,%eax
	int $0x80
/* EOF */
-----

----- cmp-jmp2.S;  gcc -o cmp-jmp2 -nostartfiles -nostdlib cmp-jmp2.S
	.balign 64
sub1:
	mov -4(%ebp),%eax
	cmp -8(%ebp),%eax; jbe 0f
	mov -8(%ebp),%eax
0:
	ret

_start: .globl _start
	nop
	and $~0<<6,%esp
	mov %esp,%ebp
	sub $4*4,%esp
	mov $0x10000000 -1,%ecx
	mov $1,%esi
	mov $2,%edi
	jmp top

	.balign 64
top:
	mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1
	mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1
	mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1; call sub1
	mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1; call sub1
	sub $1,%ecx; jnc top

	sub %ebx,%ebx
	mov $1,%eax
	int $0x80
/* EOF */
-----

-- 
John Reiser, jreiser at BitWagon.com