On Thu, 10 Aug 2023 02:07:10 PDT (-0700), Conor Dooley wrote:
On Wed, Aug 09, 2023 at 02:01:07AM -0700, Atish Kumar Patra wrote:
On Tue, Aug 8, 2023 at 6:39 AM Conor Dooley <conor@xxxxxxxxxx> wrote:
>
> On Tue, Aug 08, 2023 at 12:54:11AM -0700, Atish Kumar Patra wrote:
> > On Wed, Aug 2, 2023 at 4:14 AM Conor Dooley <conor.dooley@xxxxxxxxxxxxx> wrote:
> > >
> > > Add an erratum for versions [v0.8 to v1.3) of OpenSBI which fail to add
> > > the "no-map" property to the reserved memory nodes for the regions it
> > > has protected using PMPs.
> > >
> > > Our existing fix sweeping hibernation under the carpet by marking it
> > > NONPORTABLE is insufficient as there are other ways to generate
> > > accesses to these reserved memory regions, as Petr discovered [1]
> > > while testing crash kernels & kdump.
> > >
> > > Intercede during the boot process when the afflicted versions of OpenSBI
> > > are present & set the "no-map" property in all "mmode_resv" nodes before
> > > the kernel does its reserved memory region initialisation.
> > >
> >
> > We have different mechanisms of DT being passed to the kernel.
> >
> > 1. A prior stage(e.g U-Boot SPL) to M-mode runtime firmware (e.g.
> > OpenSBI, rustSBI) passes the DT to M-mode runtime firmware and it
> > passes to the next stage.
> > In this case, M-mode runtime firmware gets a chance to update the
> > no-map property in DT that the kernel can parse.
> >
> > 2. User loads the DT from the boot loader (e.g EDK2, U-Boot proper).
> > Any DT patching done by the M-mode firmware is useless. If these DTBs
> > don't have the no-map
> > property, hibernation or EFI booting will have issues as well.
> >
>
> > We are trying to solve only one part of problem #1 in this patch.
>
> Correct.
>
> If someone's second stage is also providing an incorrect devicetree
> then, yeah, this approach would fall apart - but it's the firmware
> provided devicetree being incorrect that I am trying to account for
> here. If a person incorrectly constructed one, I am not really sure what
> we can do for them, they incorrect described their hardware /shrug
> My patch should of course help in some of the scenarios you mention above
> if the name of the reserved memory region from OpenSBI is propagated by
> the second-stage bootloader, but that is just an extension of case 1,
> not case 2.
>
> > I
> > don't think any other M-mode runtime firmware patches DT with no-map
> > property as well.
> > Please let me know if I am wrong about that. The problem is not
> > restricted to [v0.8 to v1.3) of OpenSBI.
>
> It comes down to Alex's question - do we want to fix this kind of
> firmware issue in the kernel? Ultimately this is a policy decision that
> "somebody" has to make. Maybe the list of firmwares that need this
IMO, we shouldn't as this is a slippery slope. Kernel can't fix every
firmware bug by having erratas.
I agree with your point below about firmware in shipping products. I
am not aware of any official products shipping anything other than
OpenSBI either.
However, I have seen users using other firmwares in their dev
environment.
If someone's already changed their boards firmware, I have less sympathy
for them, as they should be able to make further changes. Punters buying
SBCs to install Fedora or Debian w/o having to consider their firmware
are who I am more interested in helping.
IMHO, this approach sets a bad precedent for the future especially
when it only solves one part of the problem.
Yeah, I'm certainly wary of setting an unwise precedent here.
Inevitably we will need to have firmware-related errata and it'd be good
to have a policy for what is (or more importantly what isn't
acceptable). Certainly we have said that known-broken version of OpenSBI
that T-Head puts in their SDK is not supported by the mainline kernel.
On the latter part, I'm perfectly happy to expand the erratum to cover
all affected firmwares, but I wasn't even sure if my fix worked
properly, hence the request for testing from those who encountered the
problem.
We shouldn't hide firmware bugs in the kernel when an upgraded
firmware is already available.
Just to note, availability of an updated firmware upstream does not
necessarily mean that corresponding update is possible for affected
hardware.
Yep. I think we're been in a very hobbist-centric world in RISC-V land,
but in general trying to get people to update firmware is hard. Part of
the whole "kernel updates don't break users" thing is what's underneath
the kernel, it's not just a uABI thing.
This bug is well documented in various threads and fixed in the latest
version of OpenSBI.
I am assuming other firmwares will follow it as well.
Anybody facing hibernation or efi related booting issues should just
upgrade to the latest version of firmware (e.g OpenSBI v1.3)
Latest version of Qemu will support(if not happened already) the
latest version of OpenSBI.
This issue will only manifest in kernels 6.4 or higher. Any user
facing these with the latest kernel can also upgrade the firmware.
Do you see any issue with that ?
I don't think it is fair to compare the ease of upgrading the kernel
to that required to upgrade a boards firmware, with the latter being
far, far more inconvenient on pretty much all of the boards that I have.
IMO we're in the same spot as every other port here, and generally they
work around firmware bugs when they've rolled out into production
somewhere that firmware updates aren't likely to happen quickly. I'm
not sure if there's any sort of exact rules written down anywhere, but
IMO if the bug is going to impact users then we should deal with it.
That applies for hardware bugs, but also firmware bugs (at a certain
point we won't be able to tell the difference). We're sort of doing
this with the misaligned access handling, for example.
I'm perfectly happy to drop this series though, if people generally are
of the opinion that this sort of firmware workaround is ill-advised.
We are unaffected by it, so I certainly have no pressure to have
something working here. It's my desire not to be user-hostile that
motivated this patch.
IIUC you guys and Reneas are the only ones who have hardware that might
be in a spot where users aren't able to update the firmware (ie, it's
out in production somewhere). So I'm adding Geert, though he probably
saw this months ago...
On that note: It's been ~4 months and it look like nobody's tested
anything (and the comments aren't really things that would preculde
testing). So maybe we just pick that second patch up into for-next and
see what happens? IIUC that will result in broken systems for users who
haven't updated their firmware.
I agree that's a user-hostile way to do things, which is generally a bad
way to go, but if it's really true that there's no users then we're
safe. Probably also worth calling it out on sw-dev just to be safe.