Re: [Patch] memory: tegra: Skip SID override from Guest VM

Marc Zyngier <maz@xxxxxxxxxx> · Wed, 07 Feb 2024 12:03:43 +0000

On Tue, 06 Feb 2024 17:08:42 +0000,
"Thierry Reding" <thierry.reding@xxxxxxxxx> wrote:
> 
> [1  <text/plain; UTF-8 (quoted-printable)>]
> On Tue Feb 6, 2024 at 3:54 PM CET, Marc Zyngier wrote:
> > On Tue, 06 Feb 2024 14:07:10 +0000,
> > "Thierry Reding" <thierry.reding@xxxxxxxxx> wrote:
> > > 
> > > [1  <text/plain; UTF-8 (quoted-printable)>]
> > > On Tue Feb 6, 2024 at 1:53 PM CET, Marc Zyngier wrote:
> > > > On Tue, 06 Feb 2024 12:28:27 +0000, Jon Hunter <jonathanh@xxxxxxxxxx> wrote:
> > > > > On 06/02/2024 12:17, Marc Zyngier wrote:
> > > [...]
> > > > > > - My own tegra186 HW doesn't have VHE, since it is ARMv8.0, and this
> > > > > >    helper will always return 'false'. How could this result in
> > > > > >    something that still works? Can I get a free CPU upgrade?
> > > > > 
> > > > > I thought this API just checks to see if we are in EL2?
> > > >
> > > > It does. And that's the problem. On ARMv8.0, we run the Linux kernel
> > > > at EL1. Tegra186 is ARMv8.0 (Denver + A57). So as written, this change
> > > > breaks the very platform it intends to support.
> > > 
> > > To clarify, the code that accesses these registers is shared across
> > > Tegra186 and later chips. Tegra194 and later do support ARMv8.1 VHE.
> >
> > But even on these machines that are VHE-capable, not running at EL2
> > doesn't mean we're running as a guest. The user can force the kernel
> > to stick to EL1, using a command-line option such as kvm-arm.mode=nvhe
> > which will force the kernel to stay at EL1 while deploying KVM at EL2.
> >
> > > Granted, if it always returns false on Tegra186 that's not what we
> > > want.
> >
> > I'm glad we agree here.
> >
> > > > > > - If you assign this device to a VM and that the hypervisor doesn't
> > > > > >    correctly virtualise it, then it is a different device and you
> > > > > >    should simply advertise it something else. Or even better, fix your
> > > > > >    hypervisor.
> > > > > 
> > > > > Sumit can add some more details on why we don't completely disable the
> > > > > device for guest OSs.
> > > >
> > > > It's not about disabling it. It is about correctly supporting it
> > > > (providing full emulation for it), or advertising it as something
> > > > different so that SW can handle it differently.
> > > 
> > > It's really not a different device. It's exactly the same device except
> > > that accessing some registers isn't permitted. We also can't easily
> > > remove parts of the register region from device tree because these are
> > > intermixed with other registers that we do want access to.
> >
> > But that's the definition of being a different device. It has a
> > different programming interface, hence it is different. The fact that
> > it is the same HW block mediated by a hypervisor doesn't really change
> > that.
> 
> The programming model isn't really different in these cases, but rather
> restricted. I think a compatible string is a suboptimal way to describe
> this.

It *is* different. If it wasn't different, you wouldn't need this
patch. I'm puzzled that we have to argue on *that*. You can call it
restricted, I call it broken. In both case, it is a *different*
programming interface as you can't use existing SW for it.

> 
> > > > Poking into the internals of how the kernel is booted for a driver
> > > > that isn't tied to the core architecture (because it would need to
> > > > access system registers, for example) is not an acceptable outcome.
> > > 
> > > So what would be the better option? Use a different compatible string to
> > > make the driver handle the device differently? Or adding a custom
> > > property to the device tree node to mark this as running in a
> > > virtualized environment?
> >
> > A different compatible string would be my preferred option. An extra
> > property would work as well. As far as I am concerned, these two
> > options are the right way to express the fact that you have something
> > that isn't quite like the real thing.
> 
> Coincidentally there's another discussion with a lot of similarities
> regarding simulated platforms. For these it's usually less about the
> register set being restricted and more about certain quirks that are
> needed which will not ultimately be necessary for silicon.
> 
> This could be a timeout that's longer in simulation, or it could be
> certain programming that would be needed in silicon but isn't necessary
> or functional in simulation (think I/O calibration, that sort of thing).
> One could argue that these are also different devices when in simulation
> but they really aren't. They're more like an approximation of the actual
> device that will be in silicon chips.

Simulation/DV environments are a very different kettle of fish. You
generally treat passing time with a scaling factor, and you are likely
to run  very hacked-up SW stack anyway.

In any case, this is not relevant to upstream stuff, unless you plan
to ship your emulation environment.

> Another problem that both of the cases have in common is that they are
> parameters that usually apply to the entire system. For some devices it
> is easier to parameterize via DT (for example for certain devices we
> have bindings with special register regions that are only available in
> host OS mode), but for others this may not be true. Adding extra
> compatible strings for virtualization/simulation is going to get quite
> complex very quickly if we need to differentiate between all of these
> scenarios.

That's the price you pay for these inconsistencies. If your "HW" has a
lot of variability and that you can't discover its capabilities from
SW, then it either badly designed, badly implemented, badly emulated,
or any combination thereof.

In any case, you get to keep the pieces.

> 
> > > Perhaps we can reuse the top-level hypervisor node? That seems to only
> > > ever have been used for Xen on 32-bit ARM, so not sure if that'd still
> > > be appropriate.
> >
> > I'd shy away from this. You would be deriving properties from a
> > hypervisor implementation, instead of expressing those properties
> > directly. In my experience, the direct method is always preferable.
> 
> I would generally agree. However, I think especially the compatible
> string solution could turn very ugly for this. If we express these
> properties via compatible strings we may very well end up with many
> different compatible strings to cover all cases.
> 
> Say you've got one hypervisor that changes the programming model in a
> certain way and a second hypervisor that constrains in a different way.
> Do we now need one compatible string for each hypervisor? Do we add
> compatible strings for each restriction and have potentially very long
> compatible string lists? Separate properties would work slightly better
> for this.

Again, the job of a hypervisor is to offer an architecturally correct
view of some HW, emulated or not. If your hypervisors are implementing
a large variety of diverging behaviours, SW needs to be able to
distinguish between those. You can either add properties, use compat
strings, or use a discovery protocol implemented by the device.

In any case, each deviation needs to be uniquely identifiable, and be
described either in FW or by the device itself, if only because Linux
isn't the only game in town.

> There are some cases where we can use register contents to determine
> what the OS is allowed to do, but these registers don't exist for all HW
> blocks. We may be able to get more added to new chips, but we obviously
> can't retroactively add them for existing ones.
> 
> A central node or property would at least allow broad parameterization.
> I would hope that at least hypervisor implementations don't vary too
> much in terms of what they restrict and what they don't, so perhaps it
> wouldn't be that bad. Perhaps that's also overly optimistic.

Top level properties are no good unless what they express is forever
immutable and described upfront. Identifying a hypervisor doesn't do
that, and most of the time there will be all sorts of *variable*
properties that need to be further discovered by a mechanism or
another. In my (surely very limited) experience at writing hypervisors
for some time, this eventually becomes an unmaintainable mess.

You are of course free to do that in the drivers you maintain as long
as you don't break my own toys, but I'd urge you to reconsider this
and explore other possibilities.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.