Re: GCC 10 using floating-point registers to pass some 64-bit arguments on ARM Cortex-M

Freddie Chopin <freddie_chopin@xxxxx> · Tue, 19 May 2020 22:28:45 +0200

On Tue, 2020-05-19 at 14:52 +0100, Richard Earnshaw wrote:
> Only d7?  No, that couldn't be right.  d7 would only be used if d0-d6
> had also been used.

I've looked at the disassembly again and my first description of
symptoms was indeed wrong, well - partially (; I've tried looking at a
bigger picture now and it seems that the parameters are not passed via
FPU registers, but FPU registers are used as intermediate helper
registers in a few places ("vldr" appears in the listing 24 times, this
application does not use any floating point types or functions).

The most common pattern is something like this:

-- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --

080112e0 <distortos::SoftwareTimerCommon::start(std::chrono::time_point<distortos::TickClock, std::chrono::duration<long long, std::ratio<1ll, 1000ll> > >, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >)>:
{
 80112e0:	b500      	push	{lr}
 80112e2:	b083      	sub	sp, #12
 80112e4:	ed9d 7b04 	vldr	d7, [sp, #16]
	softwareTimerControlBlock_.start(internal::getScheduler().getSoftwareTimerSupervisor(), timePoint, period);
 80112e8:	4904      	ldr	r1, [pc, #16]	; (80112fc <distortos::SoftwareTimerCommon::start(std::chrono::time_point<distortos::TickClock, std::chrono::duration<long long, std::ratio<1ll, 1000ll> > >, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >)+0x1c>)
 80112ea:	ed8d 7b00 	vstr	d7, [sp]
 80112ee:	3008      	adds	r0, #8
 80112f0:	f000 f840 	bl	8011374 <distortos::internal::SoftwareTimerControlBlock::start(distortos::internal::SoftwareTimerSupervisor&, std::chrono::time_point<distortos::TickClock, std::chrono::duration<long long, std::ratio<1ll, 1000ll> > >, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >)>
}
 80112f4:	2000      	movs	r0, #0
 80112f6:	b003      	add	sp, #12
 80112f8:	f85d fb04 	ldr.w	pc, [sp], #4
 80112fc:	20000a3c 	.word	0x20000a3c

-- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --

So it's a vldr followed by vstr (sometimes more than one), it seems
like a way to load 64-bit values in one step. Such pattern appears in
the code several times, it uses mostly d7, but sometimes d8 or d6 (some
parts use two registers in the same block of code, d6 and d7).

A few times compiler uses s16 as a scratch register like this:

-- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --

 80060fe:	f812 3b01 	ldrb.w	r3, [r2], #1
 8006102:	9204      	str	r2, [sp, #16]
 8006104:	ee08 3a10 	vmov	s16, r3
			const auto rawQueueWrapper = makeRawQueueWrapper<0>(dynamic, fifo);
 8006108:	f816 2b01 	ldrb.w	r2, [r6], #1
 800610c:	ee18 1a10 	vmov	r1, s16
 8006110:	a809      	add	r0, sp, #36	; 0x24
 8006112:	f7ff ff8b 	bl	800602c <std::unique_ptr<distortos::test::RawQueueWrapper, std::default_delete<distortos::test::RawQueueWrapper> >

-- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --

In this case it seems to make no sense at all, why not just move from
r3 to r1 and be done with that (s16 is not used again in this
function), or why not load into r1 directly?

Sorry for the initial confusion, I hope that this time I'm more precise
(;

> No, those changes are for handling of 64-bit integral values where we
> no-longer use Neon to perform those options and have improved the way
> code is generated to handle them using the GP registers.

I see. I'm just looking for the answer to my basic question - is this a
bug or a feature? If it's a feature, then maybe there's a way to
disable it somehow.

> Testcase needed.

I could try providing one if you really think that what I see here is a
bug, not an expected behaviour.

Regards,
FCh