Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Mon, 20 Feb 2023 11:59:26 +0000

On Fri, 17 Feb 2023 10:31:15 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:

> > On Feb 10, 2023, at 4:32 AM, Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > 
> > On Thu, 9 Feb 2023 14:04:13 -0800
> > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> >   
> >>> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:
> >>> 
> >>> On Wed, 8 Feb 2023 10:03:57 -0800
> >>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> >>>   
> >>>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@xxxxxxxxxxx> wrote:
> >>>>> 
> >>>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:      
> >>>>>> On Wed, 1 Feb 2023 12:04:56 -0800
> >>>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> >>>>>>   
> >>>>>>>>   
> >>>> 
> >>>> <skipped>
> >>>>   
> >>>>>>> 
> >>>>>>> Most probably, we will have multiple FM implementations in firmware.
> >>>>>>> Yes, FM on host could be important for debug and to verify correctness
> >>>>>>> firmware-based implementations. But FM daemon on host could be important
> >>>>>>> to receive notifications and react somehow on these events. Also, journalling
> >>>>>>> of events/messages/events could be important responsibility of FM daemon
> >>>>>>> on host.       
> >>>>>> 
> >>>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> >>>>>> that also has the lower level FM-API access).  I think it is somewhat
> >>>>>> separate from the rest of this on basis it may well just be talking redfish
> >>>>>> to the FM and there are lots of tools for that sort of handling already.
> >>>>>>   
> >>>>> 
> >>>>> I would be interested in particpating in a BOF about this topic. I wonder what
> >>>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
> >>>>> In this case, does it make more sense to have an owner of the global FM state 
> >>>>> be a user space application. Is this the job of the orchestrator?    
> >>> 
> >>> This partly comes down to terminology. Ultimately there is an FM that is
> >>> responsible for the whole fabric (could be distributed software) and that
> >>> in turn will talk to a the various BMCs that then talk to the switches.
> >>> 
> >>> Depending on the setup it may not be necessary for any entity to see the
> >>> whole fabric.
> >>> 
> >>> Interesting point in general though.  I think it boils down to getting
> >>> layering in any software correct and that is easier done from outset.
> >>> 
> >>> I don't know whether the redfish stuff is flexible enough to cover this, but
> >>> if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
> >>> and in turn presenting redfish to the orchestrator.
> >>> 
> >>> Any of these components might run on separate machines, or in firmware on
> >>> some device, or indeed all run on one server that is acting as the FM and
> >>> a node in the orchestrator layer.
> >>>   
> >>>>> 
> >>>>> The BMC based FM seems to have scalability issues, but will we hit them in
> >>>>> practice any time soon.      
> >>> 
> >>> Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
> >>> we definitely will in the medium term.
> >>>   
> >>>> 
> >>>> I had discussion recently and it looks like there are interesting points:
> >>>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
> >>>> very compute-intensive activity. So, potentially, FM on firmware side could be not
> >>>> capable to digest and executes all responsibilities without potential performance
> >>>> degradation.    
> >>> 
> >>> There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
> >>> significant devices in their own right and run Linux or other heavy weight OSes.
> >>>   
> >>>> (2) However, if we have FM on host side, then there is security concerns because
> >>>> FM sees everything and all details of multiple hosts and subsystems.    
> >>> 
> >>> Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
> >>> at lest some implementations it will be running on a capable Linux machine.
> >>> In large fabrics that may be very capable indeed (basically a server dedicated to
> >>> this role).
> >>>   
> >>>> (3) Technically speaking, there is one potential capability that user-space FM daemon
> >>>> can run as on host side as on CXL switch side. I mean here that if we implement
> >>>> user-space FM daemon, then it could be used to execute FM functionality on CXL
> >>>> switch side (maybe????). :)    
> >>> 
> >>> Sure, anything could run anywhere.  We should draw up some 'reference' architectures
> >>> though to guide discussion down the line.  Mind you I think there are a lot of
> >>> steps along the way and starting point should be a simple PoC where all the FM
> >>> stuff is in linux userspace (other than comms).  That's easy enough to do.
> >>> If I get a quiet week or so I'll hammer out what we need on emulation side to
> >>> start playing with this.
> >>> 
> >>> Jonathan
> >>> 
> >>> 
> >>>   
> >>>> 
> >>>> <skipped>
> >>>>   
> >>>>>>>>> - Manage surprise removal of devices        
> >>>>>>>> 
> >>>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> >>>>>>>> what to do in the way of managing this.  Scream loudly?
> >>>>>>>>   
> >>>>>>> 
> >>>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
> >>>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
> >>>>>>> metadata correction and helping to manage application requests to not existing
> >>>>>>> entities.      
> >>>>>> 
> >>>>>> Notifications for the host are likely to come via inband means - so type3 driver
> >>>>>> handling rather than related to FM.  As far as the host is concerned this is the
> >>>>>> same as case where there is no FM and someone ripped a device out.
> >>>>>> 
> >>>>>> There might indeed be meta data to manage, but doubt it will have anything to
> >>>>>> do with kernel.
> >>>>>>   
> >>>>> 
> >>>>> I've also had similar thoughts, I think the OS responds to notifications that
> >>>>> are generated in-band after changes to the state of the FM are made through 
> >>>>> OOB means.
> >>>>> 
> >>>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
> >>>>> implementation. Once the changes are implemented by the FM it would show up
> >>>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
> >>>>> such changes.
> >>>>>   
> >>>> 
> >>>> I think I am not completely follow your point. :) First of all, I assume that if host
> >>>> sends REDFISH request, then it will be expected the confirmation of request execution.
> >>>> It means for me that host needs to receive some packet that informs that request
> >>>> executed successfully or failed. It means that some subsystem or application requested
> >>>> this change and only after receiving the confirmation requested capabilities can be used.
> >>>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
> >>>> that some FM subsystem should be on the host side to receive confirmation/notification
> >>>> and to execute the real changes in PCIe hierarchy. Am missing something here?    
> >>> 
> >>> Another terminology issue I think.  FM from CXL side of things is an abstract thing
> >>> (potentially highly layered / distributed) that acts on instructions from an
> >>> orchestrator (also potentially highly distributed, one implementation is hosts
> >>> can be the orchestrator) and configures the fabric.
> >>> The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
> >>> Upstream probably all Redfish.  What happens in between is impdef (though
> >>> obviously mapping to Redfish or FM-API as applicable may make it more
> >>> reuseable and flexible).
> >>> 
> >>> I think some diagrams of what is where will help.
> >>> I think we need (note I've always kept the controller hosts as normal hosts as well
> >>> as that includes the case where it never uses the Fabric - so BMC type cases as
> >>> a subset without needing to double the number of diagrams).
> >>> 
> >>> 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
> >>>  to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
> >>>  mctp over say i2c.
> >>> 
> >>> 2) Diagram of same as above, with a multiple head device all connected to one host.
> >>> 
> >>> 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
> >>>  one of which is responsible for  fabric management.   FM in that manager host
> >>>  and orchestrator) - agents on other hosts able to send requests for services to that host.
> >>> 
> >>> 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
> >>>  Some other hosts that don't have any fabric control.
> >>>  Distributed FM across the controlling hosts.
> >>> 
> >>> 5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
> >>>  orchestrator, that then talks to the FM.
> >>> 
> >>> 6) 4, but push some management entities down into switches (from architecture point of
> >>>  view this is no different from layered case with a separate BMC per switch - there
> >>>  is still either a distribute FM or a layered FM, which the orchestrator talks to.)
> >>> 
> >>> Can mess with exactly distribution of who does what across the various layers.
> >>> 
> >>> I can sketch this lot up (and that will probably make some gaps in these cases apparent)
> >>> but will take a little while, hence text descriptions in the meantime.
> >>> 
> >>> I come back to my personal view though - which is don't worry too much at this early
> >>> stage, beyond making sure we have some layering in code so that we can distribute
> >>> it across a distributed or layered architecture later!   
> >>>   
> >> 
> >> I had slightly more simplified image in my mind. :) We definitely need to have diagrams
> >> to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
> >> Any suggestion?  
> > 
> > Ascii art :)  To have a broad discussion it needs to be mailing list and that
> > is effectively only option.
> >   
> 
> I tried to prepare some diagram based on ascii art. :) It looks pretty terrible in email:
> 
> ----------------------------         ------------------
> |  ---------       ------  |         |                |
> |  | Agent | <---> | FM |  |         |                |
> |  ---------       ------  |<------->|   CXL switch   |
> |            Host          |         |                |
> |                          |         |                |
> ----------------------------         —————————
other than wrong line type on the right looks fine to me ;)

> 
> I think we need to use some online resource, anyway. We are discussing with Adam what we
> can do here.
> 
> You introduced Orchestrator entity. I realized that I am not completely follow the responsibility
> of this subsystem. Do you imply some central point of management of multiple FM instances?

Absolutely - whether it's role is actually separate from the FM or not is an implementation
detail, but assumption is someone is placing the VMs etc that are using the CXL memory and
only that entity will have the knowledge of what memory to assign to which host to provide
that memory to the VMs.

> Something like a router that has knowledge base and can redirect the request to proper FM
> instance. Am I correct?

More than that.  The orchestrator would get a 'give me a VM with X normal DRAM and X CXL DRAM'
it would figure out where to put that VM across a set of systems and issue the commands
to the relevant FMs to 'make it so'.  So that's the entity that would query all the FMs
to understand what resources it is managing and then tell them what to do (possibly
via multiple layers of abstraction and sub orchestators etc).

> It sounds to me that orchestrator needs to implement some
> sub-API of FM. Or, maybe, it needs to parse REDFISH packets, for example, and only
> redirects the packets.

I'd expect individual hosts to most do what they are told to do, or maybe
ask nicely for more resources for a particular VM or application.  The hosts shouldn't
be responsible for allocating those resources, but should just be told where they
are.  That stuff might be in redfish or similar, but it's way above the level of
anything CXL specific.

Jonathan

> 
> Thanks,
> Slava.
>