Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture

Adam Manzanares <a.manzanares@xxxxxxxxxxx> · Wed, 8 Feb 2023 16:38:53 +0000

On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:
> On Wed, 1 Feb 2023 12:04:56 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> 
> > Hi Jonathan,
> > 
> > > On Jan 31, 2023, at 9:41 AM, Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > > 
> > > On Mon, 30 Jan 2023 11:11:23 -0800
> > > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> > >   
> > >> Hello,  
> > > 
> > > Hi Slava,
> > > 
> > > I'll throw some opinions at this :)
> > >   
> > >> 
> > >> I would like to suggest Fabric Manager (FM) architecture discussion. As far as I can see,
> > >> FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) QEMU emulation
> > >> of CXL hardware features. FM daemon receives requests from configuration tool and
> > >> executes commands by means of interaction with kernel-space subsystem and CXL switch
> > >> (that can be emulated by QEMU). So, the key questions for discussion:  
> > > 
> > > Worth describing operating modes to be supported: You kind of cover this later
> > > but I think pulling it out make it clearer that we want one bit of software to
> > > do several different things.
> > > 
> > > 1) FM separate from hosts and talked to by higher level orchestration software
> > >   but using a Switch CCI or MHD mailbox (over PCI)
> > >   This one is fairly easy because any security / shooting self in foot problems
> > >   are an issue for higher level software. 
> > > 2) FM on host.  Probably mostly going be relevant for debug but may use
> > >   the same mailbox as is being used by the existing CXL drivers (for Multi
> > >   Head Device it might be the end point mailbox, for Multi Logical Device
> > >   behind a switch it might be the switch mailbox).
> > > 3) All out of band (MCTP or similar - want some shared code, but no
> > >   need for anything in kernel as far as I can tell).
> > >   
> > 
> > Most probably, we will have multiple FM implementations in firmware.
> > Yes, FM on host could be important for debug and to verify correctness
> > firmware-based implementations. But FM daemon on host could be important
> > to receive notifications and react somehow on these events. Also, journalling
> > of events/messages/events could be important responsibility of FM daemon
> > on host. 
> 
> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> that also has the lower level FM-API access).  I think it is somewhat
> separate from the rest of this on basis it may well just be talking redfish
> to the FM and there are lots of tools for that sort of handling already.
> 

I would be interested in particpating in a BOF about this topic. I wonder what
happens when we have multiple switches with multiple FMs each on a separate BMC.
In this case, does it make more sense to have an owner of the global FM state 
be a user space application. Is this the job of the orchestrator?

The BMC based FM seems to have scalability issues, but will we hit them in
practice any time soon.

> > 
> > >   
> > >> (1) How to distribute functionality between user-space and kernel-space?  
> > > 
> > > Kernel for transport if mailbox based (switch or MHD).
> > > Possibly help in kernel with the host to Multiheaded device FM LD tunneling
> > > and host to switch to Multi Logical Device - Logical Device tunneling
> > > but that could also be left to userspace.
> > >   
> > 
> > People loves to move everything in user-space now. But I believe we could have
> > as kernel-space as user-space solutions. I think we ned to check what way could be
> > more efficient and elegant solution.
> 
> Agreed - though I think we need to remember running this on the host that is
> using the devices isn't likely to be a common actual usecase.  So we should
> design for that to 'work' but not to be the assumed method. Hence if any
> sync type activity is needed it might be a case of don't do the wrong thing
> rather than hard protections.
> 
> > 
> > > If MCTP use the existing MCTP framework which is underlying transport independent.
> > > I posted a PoC for how this might work a while ago (hack on top of MCTP-I2C
> > > and some emulation) In the cover letter of the emulation PoC
> > >   
> > 
> > Sounds interesting. Let me check it. But I believe it could not be not the first task
> > in this implementation. :)
> 
> Some level of MCTP support needs to be early enough that we don't get
> any design decisions wrong.  For MCTP I think the vast majority of handling
> has to be in userspace. I don't want to end up with duplication because we did
> some of that down in the kernel for the mailbox solution.
> 
> > 
> > > 
> > > I think everything else belongs in userspace. I believe there are redfish APIs
> > > etc that would then be used to query and drive the userspace program from an
> > > orchestrator or similar level software.
> > >   
> > 
> > I need to check the redfish API. It sounds reasonable to employ some existing
> > framework.
> > 
> > >> (2) Which functionality kernel-space needs to provide for implementation FM features?
> > >>      Which kernel-space functionality do we need to implement yet?  
> > > 
> > > Very little needed if we just expose the transport via PCI mailboxes.
> > > There is a possible concern that FM-API commands are frequently
> > > destructive and currently we don't let userspace poke destructive
> > > commands. That may just need a specific opt in to say we know we
> > > can shoot ourselves in the foot.
> > >   
> > 
> > I think this is why we need kernel. It sounds for me that we have to have user-space
> > and kernel-space collaboration here.
> 
> I think it will be lightweight and looks like the existing CXL mailbox userspace
> interface (some commands are the same).
> 
> > 
> > >> (3) Do we need MCTP (Management Component Transport Protocol) or some other
> > >>      protocol can be used for interaction between configuration tool, FM daemon, and
> > >>      CXL switch?  
> > > 
> > > Yes MCTP is needed.
> > > I don't think we want the actual management code to be different
> > > depending on transport / protocol.  However we might layer it so that there
> > > is an interface program that sits between the management library / program and
> > > the FM-API transport.
> > > 
> > > Note I was struggling to find a suitable MCTP interface to emulate - so would
> > > welcome suggestions on that.  I hacked the above PoC using an aspeed i2c
> > > controller that supported the right magic combination of features needed
> > > for MCTP over I2C but it doesn't have ACPI support which rather limits
> > > usage (and I doubt anyone will be keen on adding ACPI support just to
> > > test CXL related code :)  If anyone knows of a suitable MCTP host we
> > > could use for this that would be great (MCTP over PCI VDM might be nice for
> > > example)
> > >   
> > 
> > Let us start some command/feature implementation and we will figure it out.
> > But, I assume we need to start from something like CXL devices discovery at first.
> 
> Sure - some of the kernel side of that was present in the switch-cci mailbox PoC
> Obviously tooling was a test hack though ;)
> 
> > 
> > >> (4) What architecture FM implementation requires?
> > >> (5) Does it make sense to use Rust as implementation language?  
> > > 
> > > Take your pick ;) First person to write a lot of code gets to pick the language.
> > >   
> > 
> > Yeah, I see the point. Rust can provide some benefits (memory safety model, for example).
> > But it could introduce some issue with collaboration and makes implementation more
> > slow. Everybody develops in C language. But switching on Rust could be not so easy
> > target.
> > 
> > <skipped>
> > 
> > >> 
> > >> 
> > >> FM configuration tool requires such commands:  
> > > 
> > > A command line tool is fine, but like the 'real' FM configuration interface will be via
> > > a protocol (e.g. redfish).
> > > There is a WIP for CXL, though not sure on latest status on this (document on there is from
> > > 2021)
> > > 
> > > So ultimately I'd expect fm_cli to be a wrapper around libredfish / redfishtoo
> > >  that just makes it a bit easier to poke
> > > with common commands.
> > > 
> > > I'm far from an expert of redfish so may have this all wrong.
> > >   
> > 
> > Sounds reasonable to me. Let me check how good it could be for this project.
> > 
> > >> 
> > >> Discover - discover available agents
> > >> Subcommands:
> > >>    - fm_cli discover fm - discover FM instances  
> > > 
> > > If we are allowing more than one FM then I'd expect all the
> > > other commands to be directed at that by some sort of FM specific
> > > ID. If only one, what does this command do that isn't better
> > > done with fm get_info
> > >   
> > 
> > Yes, we need to identify every object somehow. And it’s interesting point.
> > From point of view, some human-friendly names could be good.
> > But firmware-based FM implementation needs to follow the same rules.
> > And it sounds for me that CXL specification should define how CXL FM or
> > CXL device identify itself. Anyway, we need to ask CXL device and it should
> > return to us some ID. Probably, it will be some GUID or likewise number.
> > 
> > >   
> > >>    - fm_cli discover cxl_devices - discover CXL devices
> > >>    - fm_cli discover logical_devices - discover logical devices  
> > > 
> > > Discover switches as well.
> > >   
> > 
> > I assumed that CXL switch is a subclass of CXL devices. Do you mean that
> > it is independent case?
> 
> Maybe simpler broken out. What you do with a switch is often very different
> form type 3 devices.
> 
> > 
> > >> 
> > >> FM - manage Fabric Manager
> > >> Subcommands:
> > >>    - fm_cli fm get_info - get FM status/info
> > >>    - fm_cli fm start - start FM instance
> > >>    - fm_cli fm restart - restart FM instance
> > >>    - fm_cli fm stop - stop FM instance
> > >>    - fm_cli fm get_config - get FM configuration
> > >>    - fm_cli fm set_config - set FM configuration  
> > > 
> > > I'd keep this slim for now.  No idea what FM config we might want to
> > > set so don't bother listing command yet.
> > >   
> > 
> > Yeah, it’s not completely clear yet. But I assume we can consider such
> > configuration options like:
> > (1) register to receive event notifications
> > (2) logging of events
> > (3) errors handling
> > 
> > >>    - fm_cli fm get_events - get event records  
> > > Not sure what FM would have in the way of events (as opposed to
> > > things it is talking to).
> > >   
> > 
> > I think FM can log events. If we consider FM daemon on host, then it
> > could issue messages to end user as reaction to some events.
> > 
> > >> 
> > >> Switch - manage CXL switch
> > >> Subcommands:
> > >>    - fm_cli switch get_info - get CXL switch info/status  
> > > 
> > > These all need an ID field of some type to identify which switch.
> > >   
> > 
> > Yeah, it is exactly what we need for every command. We need to identify
> > an object for a request.
> > 
> > >>    - fm_cli switch get_config - get switch configuraiton
> > >>    - fm_cli switch set_config - set switch configuration  
> > 
> > <skipped>
> > 
> > >> 
> > >> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device
> > >> Subcommands:
> > >>    - fm_cli dcd get_info - Get DCD Info (retrieves the number of supported hosts,
> > >>         total Dynamic Capacity of the device, and supported region configurations)
> > >>    - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region Configuration
> > >>         (retrieves the Dynamic Capacity configuration for a specified host)
> > >>    - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Configuration
> > >>         (sets the configuration of a DC Region)
> > >>    - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dynamic Capacity
> > >>         Extent List for a specified host)
> > >>    - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates the addition of
> > >>         Dynamic Capacity to the specified region on a host)  
> > > 
> > > That one is complex ;) Probably needs a whole man page to itself.
> > >   
> > 
> > Currently, it’s only declaration of command set. Yeah, implementation will be complex. :)
> > 
> > >>    - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (initiates the release of
> > >>         Dynamic Capacity from a host)
> > >> 
> > >> FM daemon receives requests from configuration tool and executes commands by means of
> > >> interaction with kernel-space subsystems. The responsibility of FM daemon could be:
> > >>    - Execute configuration tool commands
> > >>    - Manage hot-add and hot-removal of devices  
> > > 
> > > In what sense?  I'd expect it to notify some higher level entity
> > > (orchestrator or similar) but not sure I see what management the
> > > FM would do.  
> > >   
> > 
> > I assume that if FM manages some metadata, then hot-add or hot-removal could
> > require some metadata corrections. Also, hot-add and hot-removal can generate some
> > events that FM can receive and process somehow. For example, it is possible to log
> > event messages into some journal.
> 
> Ok. Potentially stuff there - though exactly which layer ends up managing this
> stuff isn't obvious to me yet.
> 
> > 
> > >>    - Manage surprise removal of devices  
> > > 
> > > Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> > > what to do in the way of managing this.  Scream loudly?
> > >   
> > 
> > Maybe, it could require application(s) notification. Let’s imagine that application
> > uses some resources from removed device. Maybe, FM can manage kernel-space
> > metadata correction and helping to manage application requests to not existing
> > entities.
> 
> Notifications for the host are likely to come via inband means - so type3 driver
> handling rather than related to FM.  As far as the host is concerned this is the
> same as case where there is no FM and someone ripped a device out.
> 
> There might indeed be meta data to manage, but doubt it will have anything to
> do with kernel.
> 

I've also had similar thoughts, I think the OS responds to notifications that
are generated in-band after changes to the state of the FM are made through 
OOB means.

I envision the host sends REDFISH requests to a switch BMC that has an FM
implementation. Once the changes are implemented by the FM it would show up
as changes to the PCIe hierarchy on a host, which is capable of responding to
such changes.

> > 
> > >>    - Receive and handle even notifications from the CXL switch
> > >>    - Logging events
> > >>    - Memory allocation and QoS Telemetry management
> > >>    - Error/Failure handling  
> > > 
> > > I'm not sure on separation of role between this component and
> > > higher level policy / admin driven software.
> > > 
> > > For memory allocation it might take a 'give host A this much
> > > memory with this characteristic set' command and own the
> > > allocations across all present devices, or it might just
> > > act as an interface layer to higher level software that does
> > > the fine detail of figuring out which device to allocate memory
> > > from to satisfy such a request.
> > > 
> > > Whilst I agree having a broad vision for an interface is good
> > > there are a lot of subtle details in some of these commands
> > > so I'd not spend too long refining the whole lot. Probably better
> > > to look at them one at a time and then just have whoever ends
> > > up maintaining / reviewing this thing responsible for making sure the
> > > parameter format etc is consistent across commands.
> > >   
> > 
> > Yes, I agree. Let’s do it step by step. I believe we need to start from
> > implementation the application that process commands and do nothing
> > at first. And first command that needs to be implemented is a discovery
> > of CXL devices, switches, and FM instances because we need to identify
> > CXL object somehow for any other command.
> 
> Agreed discover of devices and capabilities is definitely where to start
> + I think presenting that as a redfish model.
> 
> Jonathan
> 
> > 
> > Thanks,
> > Slava.
> > 
>