Re: [PATCH 1/6] dt-bindings: firmware: Add arm,errata-management

Rob Herring <robh+dt@xxxxxxxxxx> · Mon, 3 Apr 2023 10:45:51 -0500

On Fri, Mar 31, 2023 at 11:59 AM James Morse <james.morse@xxxxxxx> wrote:
>
> Hi Rob,
>
> On 31/03/2023 14:46, Rob Herring wrote:
> > On Thu, Mar 30, 2023 at 11:52 AM James Morse <james.morse@xxxxxxx> wrote:
> >> The Errata Management SMCCC interface allows firmware to advertise whether
> >> the OS is affected by an erratum, or if a higher exception level has
> >> mitigated the issue. This allows properties of the device that are not
> >> discoverable by the OS to be described. e.g. some errata depend on the
> >> behaviour of the interconnect, which is not visible to the OS.
> >>
> >> Deployed devices may find it significantly harder to update EL3
> >> firmware than the device tree. Erratum workarounds typically have to
> >> fail safe, and assume the platform is affected putting correctness
> >> above performance.
> >
> > Updating the DT is still harder than updating the kernel. A well
> > designed binding allows for errata fixes without DT changes. That
> > generally means specific compatibles up front rather than adding
> > properties to turn things on/off.
>
> I started with a per-erratum identifier, but there are 8 of them, and its hard to know
> where to put it.

That's still requiring updating the DT to fix things.

> The CPU side is detectable from the MIDR,its an interconnect property
> that we need to know ... but the interconnect isn't described in the DT. (but the obvious
> compatible string identifies the PMU)

But the interconnect could be described. In fact, there's a binding
for such things already. Surprisingly, it's called 'interconnects'...
Of course, there are lots of interconnects in SoCs and the one you
need may not be described ('cause it is invisible to s/w (until it's
not)). There's a binding going back to the CCI-400 in fact. So it
isn't really that interconnects aren't described, it's that they
aren't consistently described.

If you can add this errata table to the DT, then why not add
describing the interconnect? Then it will be there for the next thing
we need the interconnect for. I imagine some of the interconnects are
already described if not explicitly in bits and pieces (i.e. clocks or
power domains for the interconnect get tossed into some other node).

> > Do we really want to encourage/endorse/support non-updatable firmware?
> > We've rejected plenty of things with 'fix your firmware'.
>
> A DT property was explicitly requested by Marc Z on the RFC:
> https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@xxxxxxxxxx/
>
> The concern is that platforms where the CPU is affected, but the issue is masked by the
> interconnect will never bother with a firmware interface. The kernel can't know this, so
> has to enable the workaround at the cost of performance.

Sure it can. Worst case, the kernel knows the exact SoC and board it
is running on because we have root level compatibles for those. It's
just a question of whether using those would scale or not. Whether
that scales or not depends on both how long the lists are and how
distributed the implementation is (e.g. PCI quirks). More on that
below.

> Adding a DT property to describe the hardware state of the erratum avoids the need for
> separate kernel builds to work around the missing firmware.

DT is not the kernel's runtime configuration mechanism. That assumes a
tight coupling of the DT and kernel. That may work for some cases like
Android with kernel and DT updated together, but for upstream we can't
assume that coupling and must treat it as an ABI (not an issue for
this case).

The kernel command line is a runtime config mechanism for the kernel.
So why not use only that? There's certainly some room for improvement
with handling it. For example, it's not well defined with what happens
with 'bootargs' as you go from a dtb to bootloader to kernel in terms
of overriding/prepending/appending, but that could be addressed.

> >> Instead of adding a device-tree entry for any CPU errata that is
> >> relevant (or not) to the platform, allow the device-tree to describe
> >> firmware's responses for the SMCCC interface. This could be used as
> >> the data source for the firmware interface, or be parsed by the OS if
> >> the firmware interface is missing.
>
> > What's to prevent vendors from only using the DT mechanism and never
> > supporting the SMCCC interface? I'm sure some will love to not have to
> > make a firmware update when they can just fix it in DT.
>
> The firmware interface has to exist for ACPI systems where EL3 can't influence the ACPI
> tables, which typically get replaced if the part is OEM'd.
>
> Ultimately all the interface is describing is a non-discoverable hardware property
> relevant to an erratum. These are often configurations the silicon manufacturer chooses.
> In this case its the behaviour of the interconnect.
>
> If we didn't have to support ACPI systems, this stuff would only have been in the DT. With

With...?

I fail to see what ACPI has to do with DT platforms adopting the SMCCC
interface or not.

> > The downside to this extendable binding compared to simple properties
> > is parsing a flat tree is slow and more complicated. So it may be
> > difficult to support if you need this early in boot.
>
> I do need this early in the boot, but I'm not worried about the delay as these tables
> should be small.
>
>
> >> Most errata can be detected from CPU id registers. These mechanisms
> >> are only needed for the rare cases that external knowledge is needed.
> >
> > And also have significant performance impact. In the end, how many
> > platforms are there that can't fix these in firmware and need a
> > mainline/distro kernel optimized to avoid the work-around. I suspect
> > the number is small enough it could be a list in the kernel.
>
> At a guess, its all mobile phones produced in the last 2 years that want to run a version
> of Android that uses virtualisation. Cortex-A78 is affected, but I don't know when the
> first products were shipped.

How many run mainline and run it well enough to even care about the
optimization yet?

Even if we went with the above list, that's 2 years x 4 vendors (QCom,
Mediatek, Samsung, Google) x 1-2 Soc (high and mid tier). Subtract out
any vendors capable of updating their firmware. So a worst case list
of potentially 8-16 SoCs? It shouldn't grow because anything newer is
going to implement the SMCCC interface, right? That's not unmanageable
in my book.

Rob