On Thu, 9 Feb 2023 14:04:13 -0800 "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote: > > On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > On Wed, 8 Feb 2023 10:03:57 -0800 > > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote: > > > >>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@xxxxxxxxxxx> wrote: > >>> > >>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote: > >>>> On Wed, 1 Feb 2023 12:04:56 -0800 > >>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@xxxxxxxxxxxxx> wrote: > >>>> > >>>>>> > >> > >> <skipped> > >> > >>>>> > >>>>> Most probably, we will have multiple FM implementations in firmware. > >>>>> Yes, FM on host could be important for debug and to verify correctness > >>>>> firmware-based implementations. But FM daemon on host could be important > >>>>> to receive notifications and react somehow on these events. Also, journalling > >>>>> of events/messages/events could be important responsibility of FM daemon > >>>>> on host. > >>>> > >>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip > >>>> that also has the lower level FM-API access). I think it is somewhat > >>>> separate from the rest of this on basis it may well just be talking redfish > >>>> to the FM and there are lots of tools for that sort of handling already. > >>>> > >>> > >>> I would be interested in particpating in a BOF about this topic. I wonder what > >>> happens when we have multiple switches with multiple FMs each on a separate BMC. > >>> In this case, does it make more sense to have an owner of the global FM state > >>> be a user space application. Is this the job of the orchestrator? > > > > This partly comes down to terminology. Ultimately there is an FM that is > > responsible for the whole fabric (could be distributed software) and that > > in turn will talk to a the various BMCs that then talk to the switches. > > > > Depending on the setup it may not be necessary for any entity to see the > > whole fabric. > > > > Interesting point in general though. I think it boils down to getting > > layering in any software correct and that is easier done from outset. > > > > I don't know whether the redfish stuff is flexible enough to cover this, but > > if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs > > and in turn presenting redfish to the orchestrator. > > > > Any of these components might run on separate machines, or in firmware on > > some device, or indeed all run on one server that is acting as the FM and > > a node in the orchestrator layer. > > > >>> > >>> The BMC based FM seems to have scalability issues, but will we hit them in > >>> practice any time soon. > > > > Who knows ;) If anyone builds the large scale fabric stuff in CXL 3.0 then > > we definitely will in the medium term. > > > >> > >> I had discussion recently and it looks like there are interesting points: > >> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is > >> very compute-intensive activity. So, potentially, FM on firmware side could be not > >> capable to digest and executes all responsibilities without potential performance > >> degradation. > > > > There is firmware and their is firmware ;) It's not uncommon for BMCs to be > > significant devices in their own right and run Linux or other heavy weight OSes. > > > >> (2) However, if we have FM on host side, then there is security concerns because > >> FM sees everything and all details of multiple hosts and subsystems. > > > > Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in > > at lest some implementations it will be running on a capable Linux machine. > > In large fabrics that may be very capable indeed (basically a server dedicated to > > this role). > > > >> (3) Technically speaking, there is one potential capability that user-space FM daemon > >> can run as on host side as on CXL switch side. I mean here that if we implement > >> user-space FM daemon, then it could be used to execute FM functionality on CXL > >> switch side (maybe????). :) > > > > Sure, anything could run anywhere. We should draw up some 'reference' architectures > > though to guide discussion down the line. Mind you I think there are a lot of > > steps along the way and starting point should be a simple PoC where all the FM > > stuff is in linux userspace (other than comms). That's easy enough to do. > > If I get a quiet week or so I'll hammer out what we need on emulation side to > > start playing with this. > > > > Jonathan > > > > > > > >> > >> <skipped> > >> > >>>>>>> - Manage surprise removal of devices > >>>>>> > >>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea > >>>>>> what to do in the way of managing this. Scream loudly? > >>>>>> > >>>>> > >>>>> Maybe, it could require application(s) notification. Let’s imagine that application > >>>>> uses some resources from removed device. Maybe, FM can manage kernel-space > >>>>> metadata correction and helping to manage application requests to not existing > >>>>> entities. > >>>> > >>>> Notifications for the host are likely to come via inband means - so type3 driver > >>>> handling rather than related to FM. As far as the host is concerned this is the > >>>> same as case where there is no FM and someone ripped a device out. > >>>> > >>>> There might indeed be meta data to manage, but doubt it will have anything to > >>>> do with kernel. > >>>> > >>> > >>> I've also had similar thoughts, I think the OS responds to notifications that > >>> are generated in-band after changes to the state of the FM are made through > >>> OOB means. > >>> > >>> I envision the host sends REDFISH requests to a switch BMC that has an FM > >>> implementation. Once the changes are implemented by the FM it would show up > >>> as changes to the PCIe hierarchy on a host, which is capable of responding to > >>> such changes. > >>> > >> > >> I think I am not completely follow your point. :) First of all, I assume that if host > >> sends REDFISH request, then it will be expected the confirmation of request execution. > >> It means for me that host needs to receive some packet that informs that request > >> executed successfully or failed. It means that some subsystem or application requested > >> this change and only after receiving the confirmation requested capabilities can be used. > >> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me > >> that some FM subsystem should be on the host side to receive confirmation/notification > >> and to execute the real changes in PCIe hierarchy. Am missing something here? > > > > Another terminology issue I think. FM from CXL side of things is an abstract thing > > (potentially highly layered / distributed) that acts on instructions from an > > orchestrator (also potentially highly distributed, one implementation is hosts > > can be the orchestrator) and configures the fabric. > > The downstream APIs to the switches and EPs are all in FM-API (CXL spec) > > Upstream probably all Redfish. What happens in between is impdef (though > > obviously mapping to Redfish or FM-API as applicable may make it more > > reuseable and flexible). > > > > I think some diagrams of what is where will help. > > I think we need (note I've always kept the controller hosts as normal hosts as well > > as that includes the case where it never uses the Fabric - so BMC type cases as > > a subset without needing to double the number of diagrams). > > > > 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces > > to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM, > > mctp over say i2c. > > > > 2) Diagram of same as above, with a multiple head device all connected to one host. > > > > 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts, > > one of which is responsible for fabric management. FM in that manager host > > and orchestrator) - agents on other hosts able to send requests for services to that host. > > > > 4) Diagram of 3, but now with multiple switches, each with separate controlling host. > > Some other hosts that don't have any fabric control. > > Distributed FM across the controlling hosts. > > > > 5) Diagram of 4 but with layered FM and separate Orchestrator. Hosts all talk to the > > orchestrator, that then talks to the FM. > > > > 6) 4, but push some management entities down into switches (from architecture point of > > view this is no different from layered case with a separate BMC per switch - there > > is still either a distribute FM or a layered FM, which the orchestrator talks to.) > > > > Can mess with exactly distribution of who does what across the various layers. > > > > I can sketch this lot up (and that will probably make some gaps in these cases apparent) > > but will take a little while, hence text descriptions in the meantime. > > > > I come back to my personal view though - which is don't worry too much at this early > > stage, beyond making sure we have some layering in code so that we can distribute > > it across a distributed or layered architecture later! > > > > I had slightly more simplified image in my mind. :) We definitely need to have diagrams > to clarify the vision. But which collaboration tool could we use to work publicly on diagrams? > Any suggestion? Ascii art :) To have a broad discussion it needs to be mailing list and that is effectively only option. > > Thanks, > Slava. >