Re: [PATCH v2] ARM: avoid Cortex-A9 livelock on tight dmb loops

Russell King - ARM Linux admin <linux@xxxxxxxxxxxxxxx> · Thu, 31 Jan 2019 22:58:36 +0000

On Thu, Jan 31, 2019 at 01:58:05PM +0000, Will Deacon wrote:
> Hi Russell, Paul,
> 
> On Sun, Jan 27, 2019 at 03:28:50PM +0000, Russell King - ARM Linux admin wrote:
> > On Sun, Jan 27, 2019 at 01:15:31AM +0000, Paul Walmsley wrote:
> > > On Sat, 26 Jan 2019, Russell King - ARM Linux admin wrote:
> > > > On Sat, Jan 26, 2019 at 09:00:03PM +0000, Paul Walmsley wrote:
> > > > > There was some concern in the past that WFE, like WFI, might cause the 
> > > > > core to assert an external signal that might cause the SoC integration to 
> > > > > place the core into a low-power mode from which it might not be able to 
> > > > > wake up.  This could happen on OMAP, for example, with WFI.
> > > > > 
> > > > > I don't recall the outcome of those discussions.  Was a conclusion ever 
> > > > > reached?
> > > > 
> > > > First, we use WFE in spinlocks.  If WFE were to place the CPU in a
> > > > low power state that it may not be able to wake up from, all our
> > > > spinlocks would be unsafe.
> > > 
> > > Good point.  WFE must not assert the external signal that indicates 
> > > that the core is inactive.
> > > 
> > > > Next, in all of the situations in this patch, we're executing an
> > > > infinite loop.  If it were to cause the core to go into a low power
> > > > mode, surely that's a good thing, rather than the core endlessly
> > > > executing NOPs?  The only way out of that is for the core to receive
> > > > a reset _anyway_.
> > > 
> > > Makes sense.  
> > > 
> > > Do you recall what Will's reasoning was for preferring 10 NOPs to a WFE?
> > 
> > I think there may be an erratum for this which specifies 10 NOPs as
> > its workaround, but I don't have its number.
> 
> The erratum hits because cpu_relax() is a DMB instruction due to erratum
> 754327. That then triggers erratum 794072 because a tight loop of DMB
> instructions can cause a denial of service. One of the conditions for that
> to occur is:
> 
>   | * No more than 10 instructions other than the DMB are executed between
>   |   each DMB
> 
> Digging up the workaround:
> 
>   |  This erratum can be worked round by setting bit[4] of the undocumented
>   |  Diagnostic Control Register to 1. This register is encoded as
>   |  CP15 c15 0 c0 1. This bit can be written in Secure state only, with the
>   |  following Read/Modify/Write code sequence:
>   |
>   |	MRC p15,0,rt,c15,c0,1
>   |	ORR rt,rt,#0x10
>   |	MCR p15,0,rt,c15,c0,1
>   |
>   |  When it is set, this bit causes the DMB instruction to be decoded and
>   |  executed like a DSB. Using this software workaround is not expected to
>   |  have any impact on the overall performance of the processor on a typical
>   |  code base.
>   |
>   |  Other workarounds are also available for this erratum, to either prevent
>   |  or interrupt the continuous stream of DMB instructions that causes the
>   |  deadlock.
>   |
>   |  For example:
>   |	* Inserting a non-conditional Load or Store instruction in the loop
>   |	  between each DMB
>   |	* Inserting additional instructions in the loop, such as NOPs, to
>   |       avoid the processor seeing back to back DMB instructions.
>   |	* Making the processor executing the short loop take regular
>   |	  interrupts.
> 
> So the reason I prefer the NOPs is because that's guaranteed by the h/w folks
> to do the trick, whereas they say nothing about WFE. It should be dead easy to
> use NOPs instead, so I'm not sure why we're not just following the workaround
> here. We could even use NOPs + WFE if you like!

Okay, let's start off at the beginning:

machine_crash_nonpanic_core() does this:

	while (1)
		cpu_relax();

because the kernel has crashed, and we have no known safe way to deal
with the CPU.  So, we place the CPU into an infinite loop which we
expect it to _never_ exit - at least not until the system as a whole is
reset by some method.

In the absence of 754327, this code assembles to:

	b	.

In other words, an infinite loop.  When 754327 is enabled, this
becomes:

1:	dmb
	b	1b

Now, it's been so long ago (the commit says April 2018), that I don't
remember _which_ of these triggered the problem in OMAP4 where, if a
crash is triggered, the system tries to kexec into the panic kernel,
but fails after taking the secondary CPU down - placing it into one
of these loops.

The test as working solution I came up with was to add wfe() to
these infinite loops thusly:

	while (1) {
		cpu_relax();
		wfe();
	}

which, without 754327 builds to:

1:	wfe
	b	1b

or with 754327 is enabled:

1:	dmb
	wfe
	b	1b

Adding "wfe" does two things - where we're running on bare metal, and
the processor implements "wfe", it stops us spinning endlessly in a
loop where we're never going to do any useful work.

If we hit one of these loops in a VM, it allows the CPU to be given
back to the hypervisor and rescheduled for other purposes (maybe a
different VM) rather than wasting CPU cycles inside a crashed VM.

However, in light of 794072, you decided you wanted to see 10 nops
as well - which is reasonable to cover the case where we have 754327
enabled _and_ we have a processor that doesn't implement the wfe
hint.

So, we now end up with:

1:	wfe
	b	1b

when 754327 is disabled, or:

1:	dmb
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	wfe
	b	1b

when 754327 is enabled.  We also get the dmb + 10 nop sequence
elsewhere in the kernel, in terminating loops.

IMHO, this is reasonable - it means we get the workaround for 794072
when 754327 is enabled, but still relinquish the dead processor -
either by placing it in a lower power mode when wfe is implemented as
such or by returning it to the hypervisior, or in the case where wfe
is a no-op, we use the workaround specified in 794072 to avoid the
problem.

I personally see these as two entirely orthogonal problems - the 10
nops addresses 794072, and the wfe is an optimisation that makes the
system more efficient when crashed either in terms of power consumption
or by allowing the host/other VMs to make use of the CPU.

I don't see any reason not to use kexec() inside a VM - it has the
potential to provide automated recovery from a failure of the VMs
kernel with the opportunity for saving a crashdump of the failure.
A panic() with a reboot timeout won't do that, and reading the
libvirt documentation, setting on_reboot to "preserve" won't either
(the documentation states "The preserve action for an on_reboot event
is treated as a destroy".)  Surely it has to be a good thing to
avoiding having CPUs spinning inside a VM that is doing no useful
work.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up