Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

Nicolas Pitre <nico@xxxxxxxxxxx> · Wed, 25 Nov 2015 19:44:50 -0500 (EST)

On Thu, 26 Nov 2015, Måns Rullgård wrote:

> Nicolas Pitre <nico@xxxxxxxxxxx> writes:
> 
> > On Wed, 25 Nov 2015, Stephen Boyd wrote:
> >
> >> The ARM compiler inserts calls to __aeabi_uidiv() and
> >> __aeabi_idiv() when it needs to perform division on signed and
> >> unsigned integers. If a processor has support for the udiv and
> >> sdiv division instructions the calls to these support routines
> >> can be replaced with those instructions. Now that recordmcount
> >> records the locations of calls to these library functions in
> >> two sections (one for udiv and one for sdiv), iterate over these
> >> sections early at boot and patch the call sites with the
> >> appropriate division instruction when we determine that the
> >> processor supports the division instructions. Using the division
> >> instructions should be faster and less power intensive than
> >> running the support code.
> >
> > A few remarks:
> >
> > 1) The code assumes unconditional branches to __aeabi_idiv and 
> >    __aeabi_uidiv. What if there are conditional branches? Also, tail 
> >    call optimizations will generate a straight b opcode rather than a bl 
> >    and patching those will obviously have catastrophic results.  I think 
> >    you should validate the presence of a bl before patching over it.
> 
> I did a quick check on a compiled kernel I had nearby, and there are no
> conditional or tail calls to those functions, so although they should
> obviously be checked for correctness, performance is unlikely to matter
> for those.

I'm more worried about correctness than performance.  This kind of 
patching should be done defensively.

> However, there are almost half as many calls to __aeabi_{u}idivmod as to
> the plain div functions, 129 vs 228 for signed and unsigned combined.
> For best results, these functions should also be patched with the
> hardware instructions.  Obviously the call sites for these can't be
> patched.

Current __aeabi_{u}idivmod implementations are simple wrappers around a 
call to __aeabi_{u}idiv so they'd get the benefit automatically 
regardless of the chosen approach.

> > 2) For those cases where a call to __aeabi_uidiv and __aeabi_idiv is not 
> >    patched due to (1), you could patch a uidiv/idiv plus "bx lr" 
> >    at those function entry points too.
> >
> > 3) In fact I was wondering if the overhead of the branch and back is 
> >    really significant compared to the non trivial cost of a idiv 
> >    instruction and all the complex infrastructure required to patch 
> >    those branches directly, and consequently if the performance 
> >    difference is actually worth it versus simply doing (2) alone.
> 
> Depending on the operands, the div instruction can take as few as 3
> cycles on a Cortex-A7.

Even the current software based implementation can produce a result with 
about 5 simple ALU instructions depending on the operands.

The average cycle count is more important than the easy-way-out case. 
And then how significant the two branches around it are compared to idiv 
alone from direct patching of every call to it.

Nicolas