Re: Question about the use of C99/gcc built-in math intrinsics within GEGL on gfloats

Nicolas Robidoux <nrobidoux@xxxxxxxxxxxxxxxx> · Fri, 12 Sep 2008 23:36:57 -0400 (EDT)

I just completed a quick and dirty benchmark comparing the use of 
arithmetic branching using c99/gcc intrinsics within the yafr sampler
code, to using the standard c if then else.

These tests were performed on a Thinkpad t60p with Intel(R) Core(TM)2
CPU T7200 @ 2.00GHz with 2025MiB memory running 2.6.24-19-generic #1
SMP by way of a pretty standard Ubuntu 8.04.

Warning: There seems to be something wrong with math.h with the
current version of gcc, as suggested by some recent bug postings. For
example, according to the gcc documentation, I should not have to
prefix fminf with __builtin_. Consequently, it could be that the
benchmark results will soon be made irrelevant.

Second warning: If my memory is good, Intel chips have a good and fast
implementation of the "? :" branching construct (having to do with
selecting which register to copy into another), as well as good branch
prediction. My code without intrinsics is structured to take advantage
of this.

Third warning: I have not optimized looking at the assembler output of
gcc, and have done no optimization of the "arithmetic branching"
version of the code. In particular, I have not used fmaf, even though
my code is peppered with opportunity to use it (this may not be a big
deal: apparently, gcc attempts to spot opportunities to use fused
multiply-add).

------------------------------
quick description of the test:
------------------------------

I ran a bunch of consecutive scalings (times 20) of a digital
photograph with initial dimensions 200x133, driving the gegl scale
through an xml file analogous to the ones in gegl/docs/gallery,
alternating between the "with branching" and "arithmetic branching
with intrinsics" versions, and throwing in four scalings with the gegl
stock linear.

-------------------------------------------------
Differences between the two versions of the code:
-------------------------------------------------

16 code segments resembling the following (note the ?: this the
version with branching):

  const gfloat prem_squared = prem * prem_;
  const gfloat deux_squared = deux * deux_;
  const gfloat troi_squared = troi * troi_;
  const gfloat prem_times_deux = prem * deux;
  const gfloat deux_times_troi = deux * troi;
  const gfloat deux_squared_minus_prem_squared = deux_squared - prem_squared;
  const gfloat troi_squared_minus_deux_squared = troi_squared - deux_squared;
  const gfloat prem_vs_deux =
    deux_squared_minus_prem_squared > (gfloat) 0. ? prem : deux;
  const gfloat deux_vs_troi=
    troi_squared_minus_deux_squared > (gfloat) 0. ? deux: troi;
  const gfloat my__up =
    prem_times_deux > (gfloat) 0. ? prem_vs_deux : (gfloat) 0.;
  const gfloat my_dow =
    deux_times_troi> (gfloat) 0. ? deux_vs_troi : (gfloat) 0.;

were replaced by (this is the version with arithmetic branching):

  const gfloat abs_prem = fabsf( prem );
  const gfloat abs_deux = fabsf( deux );
  const gfloat abs_troi = fabsf( troi );
  const gfloat prem_vs_deux = __builtin_fminf( abs_prem, abs_deux );
  const gfloat deux_vs_troi = __builtin_fminf( abs_deux, abs_troi );
  const gfloat sign_prem = copysignf( prem, (gfloat) 1. );
  const gfloat sign_deux = copysignf( deux, (gfloat) 1. );
  const gfloat sign_troi = copysignf( troi, (gfloat) 1. );
  const gfloat my__up =
    ( sign_prem * sign_deux + (gfloat) 1. ) * prem_vs_deux;
  const gfloat my_dow =
    ( sign_deux * sign_troi + (gfloat) 1. ) * prem_deux_0_vs_troi;

Basically, what the code snippets does is this:

If prem and deux have the same sign, put the smallest one (in absolute
value) in my__up. Otherwise, set my__up to zero. Do likewise with
deux, troi and my_dow. The above two code snippets represent the best
ways of performing this that I could figure.

===================
Overall conclusion:
===================

Arithmetic branching (without other improvements) does not appear to
be worth the trouble.

================
Average timings:
================

stock gegl linear scale:

47.50 = ( 47.474 + 47.581 + 47.345 + 47.595 ) / 4

gegl yafr with ? branching and no use of intrinsics:

52.58 = 
( 52.422 + 52.479 + 52.748 + 52.501 + 52.680 + 52.623 + 52.537 +
52.518 + 52.576 + 52.487 + 52.542 + 52.485 + 52.645 + 52.810 + 52.667
+ 52.554 ) / 16

gegl yafr performing arithmetic branching with fabsf, copysignf and fminf:

52.70 = ( 52.568 + 52.447 + 52.763 + 52.524 + 52.772 + 52.652 + 52.524
+ 52.765 + 52.596 + 52.850 + 52.733 + 52.799 + 52.627 + 52.897 +
52.871 + 52.866 ) / 16

As you can see, the "?" version is slightly faster overall. Probably
not in a significant way, but this certainly does not suggest that
this is worth the hassle.

Nicolas Robidoux
Laurentian University/Universite Laurentienne

_______________________________________________
Gegl-developer mailing list
Gegl-developer@xxxxxxxxxxxxxxxxxxxxxx
https://lists.XCF.Berkeley.EDU/mailman/listinfo/gegl-developer