Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Thanks for the reply.

These data I presented is acquired from a cortex-a57 CPU.    

Since spec2017 does result check and will give a test report which indicates miscomputed cases, I suppose the performance improvement is valid.

The point that you mentioned in some modern CPU, fdiv is faster than the reciprocal approximation is a new aspect I haven¡¯t come cross.

Nevertheless, in a CPU that reciprocal approximation make a profit, like my case, may I ask why the number of newton iteration is fixed to 2 and 3?

And do you think it worth us providing a parameter to alter the iteration so that the accuracy can be a trade-off of speed.

By the way, the original data is as following. 


Test case           |       Improvement
603.bwaves_s           7.92%
607.cactuBSSN_s      Output miscompare
619.lbm_s                32.34%
621.wrf_s                 Output miscompare
627.cam4_s              Output miscompare
628.pop2_s               Output miscompare
638.imagick_s           -0.97%
644.nab_s                 9.09%
649.fotonik3d_s         Output miscompare
654.roms_s               -3.45%


------------------ Original ------------------
From:&nbsp;"Wilco Dijkstra"<Wilco.Dijkstra@xxxxxxx&gt;;
Date:&nbsp;Mon, Feb 24, 2020 08:59 PM
To:&nbsp;"gcc-help@xxxxxxxxxxx"<gcc-help@xxxxxxxxxxx&gt;;"Bu Le"<cityubule@xxxxxx&gt;;

Subject:&nbsp;Re: [AArch64][Spec2017]Question about mlow-precision-div optimization.



Hi,

&gt; I found that the mlow-precision-div option have a fix number of newton iterations, 
&gt; which is 2 for float type and 3 for double type.
&gt;
&gt; I noticed that if I alter the numbers of newton iterations as following, it could leads
&gt; to faster performance in SPEC2017 fpspeed test &amp;nbsp;on AArch64, with less but
&gt; acceptable precision.
&nbsp;
Which CPU did you try this on? Those results look suspicious - lbm hardly does any
divisions for example, so either the computation has gone wrong due to the lower
accuracy or your CPU has a really slow divide...

On modern cores it is faster to do a division than to use the division approximation
instructions. Eg. on Neoverse N1 a float division takes at most 10 cycles while the
reduced approximation takes 13 cycles (and needs 3 extra instructions which take up
decode and issue slots).

Cheers,
Wilco




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux