Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture

"Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> · Thu, 9 Feb 2023 14:04:13 -0800

> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:
> 
> On Wed, 8 Feb 2023 10:03:57 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
> 
>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@xxxxxxxxxxx> wrote:
>>> 
>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:  
>>>> On Wed, 1 Feb 2023 12:04:56 -0800
>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote:
>>>> 
>>>>>> 
>> 
>> <skipped>
>> 
>>>>> 
>>>>> Most probably, we will have multiple FM implementations in firmware.
>>>>> Yes, FM on host could be important for debug and to verify correctness
>>>>> firmware-based implementations. But FM daemon on host could be important
>>>>> to receive notifications and react somehow on these events. Also, journalling
>>>>> of events/messages/events could be important responsibility of FM daemon
>>>>> on host.   
>>>> 
>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
>>>> that also has the lower level FM-API access).  I think it is somewhat
>>>> separate from the rest of this on basis it may well just be talking redfish
>>>> to the FM and there are lots of tools for that sort of handling already.
>>>> 
>>> 
>>> I would be interested in particpating in a BOF about this topic. I wonder what
>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
>>> In this case, does it make more sense to have an owner of the global FM state 
>>> be a user space application. Is this the job of the orchestrator?
> 
> This partly comes down to terminology. Ultimately there is an FM that is
> responsible for the whole fabric (could be distributed software) and that
> in turn will talk to a the various BMCs that then talk to the switches.
> 
> Depending on the setup it may not be necessary for any entity to see the
> whole fabric.
> 
> Interesting point in general though.  I think it boils down to getting
> layering in any software correct and that is easier done from outset.
> 
> I don't know whether the redfish stuff is flexible enough to cover this, but
> if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
> and in turn presenting redfish to the orchestrator.
> 
> Any of these components might run on separate machines, or in firmware on
> some device, or indeed all run on one server that is acting as the FM and
> a node in the orchestrator layer.
> 
>>> 
>>> The BMC based FM seems to have scalability issues, but will we hit them in
>>> practice any time soon.  
> 
> Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
> we definitely will in the medium term.
> 
>> 
>> I had discussion recently and it looks like there are interesting points:
>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
>> very compute-intensive activity. So, potentially, FM on firmware side could be not
>> capable to digest and executes all responsibilities without potential performance
>> degradation.
> 
> There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
> significant devices in their own right and run Linux or other heavy weight OSes.
> 
>> (2) However, if we have FM on host side, then there is security concerns because
>> FM sees everything and all details of multiple hosts and subsystems.
> 
> Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
> at lest some implementations it will be running on a capable Linux machine.
> In large fabrics that may be very capable indeed (basically a server dedicated to
> this role).
> 
>> (3) Technically speaking, there is one potential capability that user-space FM daemon
>> can run as on host side as on CXL switch side. I mean here that if we implement
>> user-space FM daemon, then it could be used to execute FM functionality on CXL
>> switch side (maybe????). :)
> 
> Sure, anything could run anywhere.  We should draw up some 'reference' architectures
> though to guide discussion down the line.  Mind you I think there are a lot of
> steps along the way and starting point should be a simple PoC where all the FM
> stuff is in linux userspace (other than comms).  That's easy enough to do.
> If I get a quiet week or so I'll hammer out what we need on emulation side to
> start playing with this.
> 
> Jonathan
> 
> 
> 
>> 
>> <skipped>
>> 
>>>>>>>  - Manage surprise removal of devices    
>>>>>> 
>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
>>>>>> what to do in the way of managing this.  Scream loudly?
>>>>>> 
>>>>> 
>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
>>>>> metadata correction and helping to manage application requests to not existing
>>>>> entities.  
>>>> 
>>>> Notifications for the host are likely to come via inband means - so type3 driver
>>>> handling rather than related to FM.  As far as the host is concerned this is the
>>>> same as case where there is no FM and someone ripped a device out.
>>>> 
>>>> There might indeed be meta data to manage, but doubt it will have anything to
>>>> do with kernel.
>>>> 
>>> 
>>> I've also had similar thoughts, I think the OS responds to notifications that
>>> are generated in-band after changes to the state of the FM are made through 
>>> OOB means.
>>> 
>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
>>> implementation. Once the changes are implemented by the FM it would show up
>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
>>> such changes.
>>> 
>> 
>> I think I am not completely follow your point. :) First of all, I assume that if host
>> sends REDFISH request, then it will be expected the confirmation of request execution.
>> It means for me that host needs to receive some packet that informs that request
>> executed successfully or failed. It means that some subsystem or application requested
>> this change and only after receiving the confirmation requested capabilities can be used.
>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
>> that some FM subsystem should be on the host side to receive confirmation/notification
>> and to execute the real changes in PCIe hierarchy. Am missing something here?
> 
> Another terminology issue I think.  FM from CXL side of things is an abstract thing
> (potentially highly layered / distributed) that acts on instructions from an
> orchestrator (also potentially highly distributed, one implementation is hosts
> can be the orchestrator) and configures the fabric.
> The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
> Upstream probably all Redfish.  What happens in between is impdef (though
> obviously mapping to Redfish or FM-API as applicable may make it more
> reuseable and flexible).
> 
> I think some diagrams of what is where will help.
> I think we need (note I've always kept the controller hosts as normal hosts as well
> as that includes the case where it never uses the Fabric - so BMC type cases as
> a subset without needing to double the number of diagrams).
> 
> 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
>   to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
>   mctp over say i2c.
> 
> 2) Diagram of same as above, with a multiple head device all connected to one host.
> 
> 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
>   one of which is responsible for  fabric management.   FM in that manager host
>   and orchestrator) - agents on other hosts able to send requests for services to that host.
> 
> 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
>   Some other hosts that don't have any fabric control.
>   Distributed FM across the controlling hosts.
> 
> 5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
>   orchestrator, that then talks to the FM.
> 
> 6) 4, but push some management entities down into switches (from architecture point of
>   view this is no different from layered case with a separate BMC per switch - there
>   is still either a distribute FM or a layered FM, which the orchestrator talks to.)
> 
> Can mess with exactly distribution of who does what across the various layers.
> 
> I can sketch this lot up (and that will probably make some gaps in these cases apparent)
> but will take a little while, hence text descriptions in the meantime.
> 
> I come back to my personal view though - which is don't worry too much at this early
> stage, beyond making sure we have some layering in code so that we can distribute
> it across a distributed or layered architecture later!   
> 

I had slightly more simplified image in my mind. :) We definitely need to have diagrams
to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
Any suggestion?

Thanks,
Slava.