Re: [PATCH v2 6/8] fwctl: Add documentation

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 8 Aug 2024 09:24:13 -0300

On Tue, Aug 06, 2024 at 10:03:36AM +0200, Daniel Vetter wrote:
> On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> > Document the purpose and rules for the fwctl subsystem.
> > 
> > Link in kdocs to the doc tree.
> > 
> > Nacked-by: Jakub Kicinski <kuba@xxxxxxxxxx>
> > Link: https://lore.kernel.org/r/20240603114250.5325279c@xxxxxxxxxx
> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
> 
> I think we'll need something like fwctl rather sooner than later for gpus
> too, so for all the fwctl patches up to this one:

Yes, I think so as well. You can see it already in the various GPU out
of tree stuff where there is usually some expansive
monitoring/debug/provisioning tool there too.

> Acked-by: Daniel Vetter <daniel.vetter@xxxxxxxx>

Thanks!

> I both really liked the approach to fwctl_unregister and
> copy_struct_from_user, and didn't spot anything offensive in the code.

It is copied from iommufd which copied concepts from the fixedup modern
rdma :) Proven over a few years now.

> > +Further, within the function, devices often have RPC commands that fall within
> > +some general scopes of action:
> 
> In your fwctl_rpc_scope you only have 4, not 5, and I think that's
> clearer. Might also be good to put a kerneldoc link to the enum in here
> for the details.

I bundled these two together in the enum FWCTL_RPC_CONFIGURATION:

> > + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> > +    function reset.
> > +
> > + 2. Access to function & child runtime configuration that kernel drivers can
> > +    discover at runtime.
> 
> This one worries me, since it has potential for people to get it very
> wrong and e.g. expose configuration where it's only safe if the driver
> isn't bound, or at least no userspace is using it. 

The notion was "at runtime" meaning any active user of the device
would either not be aware of whatever change or already have some way
to learn about it.

Especially when we think about child configuration - ie configuration
of a VF from the hypervisor while a VM is running - there can be
useful things that fit under this category. For instance you might
throttle the device to support live migration.

Throttling is a really complex topic for an autonomous device like
GPU/RDMA.

> I'd drop this and just
> leave the one configuration rpc, with maybe more detail what exactly "When
> configuration is written to the device remains in a fully supported
> state." 

Ah, this language is incorporating the distro feedback around loosing
the ability to support the system.

> from the kerneldoc means. I think only safe options woulb be a)
> applied on function reset b) transparent to both kernel and userspace
> (beyond maybe device performance).

Let's lean into transparent more:

 1. Access to function & child configuration, FLASH, etc. that becomes live at a
    function reset. Access to function & child runtime configuration that is
    transparent or non-disruptive to any driver or VM.

> That might not cut all configuration items, but for those I think it'd be
> best if fwctl guarantees through the driver model that there's no driver
> bound to that function (or used by vfio/kvm), to guarantee safety. So
> explicitly split them out as runtime configuration with a distinct rpc
> scope. Maybe an addition for later.

No driver bound is too strong. VFIO and even mlx5_core can trigger a
FLR while still bound in a controlled way. Taking effect at FLR time
is a reasonable restriction.

Like if you reconfigure a child VF and then start a VM there will be
an automatic FLR in that process that can make the updated VF
configuration active.

> > + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> > +    driver can react to the device configuration at function reset/driver load
> > +    time, but otherwise should not be coupled to fwctl.
> 
> Kinda the same worry as above, I think this should be "... but otherwise
> _must_ not be coupled to fwctl".

Yep

> > + 4. Operate the HW in a way that overlaps with the core purpose of another
> > +    primary kernel subsystem, such as read/write to LBAs, send/receive of
> > +    network packets, or operate an accelerator's data plane.
> 
> I still think some words about what to do when fwctl exposes some
> functional which later on is covered by a newly added subsystem that
> didn't yet exist. Also maybe adding some more examples from the RAS side
> of things, since that's come up a few time in the ksummit-discuss thread,
> plus I think it's where we'll most likely have a new subsystem or extended
> functionality of an existing one pop up and cause conflicts with fwctl
> rpcs that have landed earlier.

How about:

Operations exposed through fwctl's non-taining interfaces should be fully
sharable with other users of the device. For instance exposing a RPC through
fwctl should never prevent a kernel subsystem from also concurrently using that
same RPC or hardware unit down the road. In such cases fwctl will be less
important than proper kernel subsystems that eventually emerge. Mistakes in this
area resulting in clashes will be resolved in favour of a kernel implementation.

> > +fwctl Driver design
> > +-------------------
> > +
> > +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> > +device possibly using the auxiliary_device mechanism. In that case several
> > +subsystems are going to be sharing the same device and FW interface layer so the
> > +device design must already provide for isolation and cooperation between kernel
> > +subsystems. fwctl should fit into that same model.
> > +
> > +Part of the driver should include a description of how its scope restrictions
> > +and security model work. The driver and FW together must ensure that RPCs
> > +provided by user space are mapped to the appropriate scope. If the validation is
> > +done in the driver then the validation can read a 'command effects' report from
> > +the device, or hardwire the enforcement. If the validation is done in the FW,
> > +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> > +
> > +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> > +any FW resources, or any resources it does allocate are freed on FD closure.  A
> > +driver primarily constructed around FW RPCs may find that its core PCI function
> > +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> > +subsystems.
> > +
> > +Each device type must represent a stable FW ABI, such that the userspace
> > +components have the same general stability we expect from the kernel. FW upgrade
> > +should not break the userspace tools.
> 
> I think an exception for the debug rpcs (or maybe only
> FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
> for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
> to remove outdated ones. 

I'm definitely mindful of Linus's pragmatic view of ABI stability, where
existing tools that *people actually use* shouldn't break. You
shouldn't have to upgrade your tools when you upgrade your FW.

I think it is important to convey that as a the gold standard here
too.

> And especially for field and even more in-house debug tooling, you really
> want the userspace version matching your fw anyway.

Yes, this is definately the case. But those tools are also private and
I think fall under the *people actually use* exception.

So, lets try again:

Each device type must be mindful of Linux's philosophy for stable ABI. The FW
RPC interface does not have to meet a strictly stable ABI, but it does need to
meet an expectation that userspace tools that are deployed and in significant
use don't needlessly break. FW upgrade and kernel upgrade should keep widely
deployed tooling working.

Development and debugging focused RPCs under more permissive scopes can have
less stablitiy if the tools using them are only run under exceptional
circumstances and not for every day use of the device. Debugging tools may even
require exact version matching as they may require something similar to DWARF
debug information from the FW binary.

> Currently that mess tends to leave in debugfs and/or out-of-tree, so
> there's no stable uapi guarantee anyway. And I don't see the point in
> requiring it - if there is a need for stabling tooling, maybe that
> indicates more the need for a new subsystem that standardized things
> across devices/vendors.

Yes, fwctl needs to have both. I would expect things like the
configuration to have a fairly stable ABI. Maybe the list of
configurables will change but access to them should be ABI stable.

> > +Some examples:
> > +
> > + - HW RAID controllers. This includes RPCs to do things like compose drives into
> > +   a RAID volume, configure RAID parameters, monitor the HW and more.
> > +
> > + - Baseboard managers. RPCs for configuring settings in the device and more
> > +
> > + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> > +   functions that different products have defined, but more exists.
> > +
> > + - CXL also has a NVMe-like vendor command system.
> > +
> > + - DRM allows user space drivers to send commands to the device via kernel
> > +   mediation
> > +
> > + - RDMA allows user space drivers to directly push commands to the device
> > +   without kernel involvement
> > +
> > + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> > +
> > +The first 4 are examples of areas that fwctl intends to cover.
> 
> Maybe add a sentence here why the latter 3 aren't, just to strengthen that
> point?

How about:

The first 4 are examples of areas that fwctl intends to cover. The latter three
are examples of denied behavior as they fully overlap with the primary purpose
of a kernel subsystem.

Thanks
Jason