Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Thu, 9 Feb 2023 11:05:02 +0000

On Wed, 8 Feb 2023 10:03:57 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:

> > On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@xxxxxxxxxxx> wrote:
> > 
> > On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:  
> >> On Wed, 1 Feb 2023 12:04:56 -0800
> >> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> >>   
> >>>>   
> 
> <skipped>
> 
> >>> 
> >>> Most probably, we will have multiple FM implementations in firmware.
> >>> Yes, FM on host could be important for debug and to verify correctness
> >>> firmware-based implementations. But FM daemon on host could be important
> >>> to receive notifications and react somehow on these events. Also, journalling
> >>> of events/messages/events could be important responsibility of FM daemon
> >>> on host.   
> >> 
> >> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> >> that also has the lower level FM-API access).  I think it is somewhat
> >> separate from the rest of this on basis it may well just be talking redfish
> >> to the FM and there are lots of tools for that sort of handling already.
> >>   
> > 
> > I would be interested in particpating in a BOF about this topic. I wonder what
> > happens when we have multiple switches with multiple FMs each on a separate BMC.
> > In this case, does it make more sense to have an owner of the global FM state 
> > be a user space application. Is this the job of the orchestrator?

This partly comes down to terminology. Ultimately there is an FM that is
responsible for the whole fabric (could be distributed software) and that
in turn will talk to a the various BMCs that then talk to the switches.

Depending on the setup it may not be necessary for any entity to see the
whole fabric.

Interesting point in general though.  I think it boils down to getting
layering in any software correct and that is easier done from outset.

I don't know whether the redfish stuff is flexible enough to cover this, but
if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
and in turn presenting redfish to the orchestrator.

Any of these components might run on separate machines, or in firmware on
some device, or indeed all run on one server that is acting as the FM and
a node in the orchestrator layer.

> > 
> > The BMC based FM seems to have scalability issues, but will we hit them in
> > practice any time soon.  

Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
we definitely will in the medium term.

> 
> I had discussion recently and it looks like there are interesting points:
> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
> very compute-intensive activity. So, potentially, FM on firmware side could be not
> capable to digest and executes all responsibilities without potential performance
> degradation.

There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
significant devices in their own right and run Linux or other heavy weight OSes.

> (2) However, if we have FM on host side, then there is security concerns because
> FM sees everything and all details of multiple hosts and subsystems.

Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
at lest some implementations it will be running on a capable Linux machine.
In large fabrics that may be very capable indeed (basically a server dedicated to
this role).

> (3) Technically speaking, there is one potential capability that user-space FM daemon
> can run as on host side as on CXL switch side. I mean here that if we implement
> user-space FM daemon, then it could be used to execute FM functionality on CXL
> switch side (maybe????). :)

Sure, anything could run anywhere.  We should draw up some 'reference' architectures
though to guide discussion down the line.  Mind you I think there are a lot of
steps along the way and starting point should be a simple PoC where all the FM
stuff is in linux userspace (other than comms).  That's easy enough to do.
If I get a quiet week or so I'll hammer out what we need on emulation side to
start playing with this.

Jonathan

> 
> <skipped>
> 
> >>>>>   - Manage surprise removal of devices    
> >>>> 
> >>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> >>>> what to do in the way of managing this.  Scream loudly?
> >>>>   
> >>> 
> >>> Maybe, it could require application(s) notification. Let’s imagine that application
> >>> uses some resources from removed device. Maybe, FM can manage kernel-space
> >>> metadata correction and helping to manage application requests to not existing
> >>> entities.  
> >> 
> >> Notifications for the host are likely to come via inband means - so type3 driver
> >> handling rather than related to FM.  As far as the host is concerned this is the
> >> same as case where there is no FM and someone ripped a device out.
> >> 
> >> There might indeed be meta data to manage, but doubt it will have anything to
> >> do with kernel.
> >>   
> > 
> > I've also had similar thoughts, I think the OS responds to notifications that
> > are generated in-band after changes to the state of the FM are made through 
> > OOB means.
> > 
> > I envision the host sends REDFISH requests to a switch BMC that has an FM
> > implementation. Once the changes are implemented by the FM it would show up
> > as changes to the PCIe hierarchy on a host, which is capable of responding to
> > such changes.
> >   
> 
> I think I am not completely follow your point. :) First of all, I assume that if host
> sends REDFISH request, then it will be expected the confirmation of request execution.
> It means for me that host needs to receive some packet that informs that request
> executed successfully or failed. It means that some subsystem or application requested
> this change and only after receiving the confirmation requested capabilities can be used.
> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
> that some FM subsystem should be on the host side to receive confirmation/notification
> and to execute the real changes in PCIe hierarchy. Am missing something here?

Another terminology issue I think.  FM from CXL side of things is an abstract thing
(potentially highly layered / distributed) that acts on instructions from an
orchestrator (also potentially highly distributed, one implementation is hosts
can be the orchestrator) and configures the fabric.
The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
Upstream probably all Redfish.  What happens in between is impdef (though
obviously mapping to Redfish or FM-API as applicable may make it more
reuseable and flexible).

I think some diagrams of what is where will help.
I think we need (note I've always kept the controller hosts as normal hosts as well
as that includes the case where it never uses the Fabric - so BMC type cases as
a subset without needing to double the number of diagrams).

1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
   to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
   mctp over say i2c.

2) Diagram of same as above, with a multiple head device all connected to one host.

3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
   one of which is responsible for  fabric management.   FM in that manager host
   and orchestrator) - agents on other hosts able to send requests for services to that host.

4) Diagram of 3, but now with multiple switches, each with separate controlling host.
   Some other hosts that don't have any fabric control.
   Distributed FM across the controlling hosts.

5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
   orchestrator, that then talks to the FM.

6) 4, but push some management entities down into switches (from architecture point of
   view this is no different from layered case with a separate BMC per switch - there
   is still either a distribute FM or a layered FM, which the orchestrator talks to.)

Can mess with exactly distribution of who does what across the various layers.

I can sketch this lot up (and that will probably make some gaps in these cases apparent)
but will take a little while, hence text descriptions in the meantime.

I come back to my personal view though - which is don't worry too much at this early
stage, beyond making sure we have some layering in code so that we can distribute
it across a distributed or layered architecture later!   

Jonathan

> 
> Thanks,
> Slava.
> 
>