Hi Reinette, On Wed, Aug 14, 2024 at 10:37 AM Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote: > > Hi Peter, > > On 8/2/24 3:50 PM, Peter Newman wrote: > > On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre > > <reinette.chatre@xxxxxxxxx> wrote: > >> On 8/2/24 11:49 AM, Peter Newman wrote: > >>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre > >>>> I am of course not familiar with details of the software implementation > >>>> - could there be benefits to using it even if hardware counters are > >>>> supported? > >>> > >>> I can't see any situation where the user would want to choose software > >>> over hardware counters. The number of groups which can be monitored by > >>> software assignable counters will always be less than with hardware, > >>> due to the need for consuming one RMID (and the counters automatically > >>> allocated to it by the AMD hardware) for all unassigned groups. > >> > >> Thank you for clarifying. This seems specific to this software implementation, > >> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember > >> correctly this depends on undocumented hardware specific knowledge. > > > > For the benefit of anyone else who needs to monitor bandwidth on a > > large number of monitoring groups on pre-ABMC AMD implementations, > > hopefully a future AMD publication will clarify, at least on some > > existing, pre-ABMC models, exactly when the QM_CTR.U bit is set. > > > > > >>> > >>> The behavior as I've implemented today is: > >>> > >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events > >>> 0 > >>> > >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control > >>> test//0=_;1=_; > >>> //0=_;1=_; > >>> > >>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control > >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control > >>> test//0=_;1=tl; > >>> //0=_;1=_; > >>> > >>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control > >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control > >>> test//0=_;1=_; > >>> //0=_;1=_; > >>> > >>> > >> > >> This highlights how there cannot be a generic/consistent interface between hardware > >> and software implementation. If resctrl implements something like above without any > >> other hints to user space then it will push complexity to user space since user space > >> would not know if setting one flag results in setting more than that flag, which may > >> force a user space implementation to always follow a write with a read that > >> needs to confirm what actually resulted from the write. Similarly, that removing a > >> flag impacts other flags needs to be clear without user space needing to "try and > >> see what happens". > > > > I'll return to this topic in the context of MPAM below... > > > >> It is not clear to me how to interpret the above example when it comes to the > >> RMID management though. If the RMID assignment is per group then I expected all > >> the domains of a group to have the same flag(s)? > > > > The group RMIDs are never programmed into any MSRs and the RMID space > > is independent in each domain, so it is still possible to do > > per-domain assignment. (and like with soft RMIDs, this enables us to > > create unlimited groups, but we've never been limited by the size of > > the RMID space) > > > > However, in our use cases, jobs are not confined to any domain, so > > bandwidth measurements must be done simultaneously in all domains, so > > we have no current use for per-domain assignment. But if any Google > > users did begin to see value in confining jobs to domains, this could > > change. > > > >> > >>>> > >>>>> However, If we don't expect to see these semantics in any other > >>>>> implementation, these semantics could be implicit in the definition of > >>>>> a SW assignable counter. > >>>> > >>>> It is not clear to me how implementation differences between hardware > >>>> and software assignment can be hidden from user space. It is possible > >>>> to let user space enable individual events and then silently upgrade it > >>>> to all events. I see two options here, either "mbm_control" needs to > >>>> explicitly show this "silent upgrade" so that user space knows which > >>>> events are actually enabled, or "mbm_control" only shows flags/events enabled > >>>> from user space perspective. In the former scenario, this needs more > >>>> user space support since a generic user space cannot be confident which > >>>> flags are set after writing to "mbm_control". In the latter scenario, > >>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected > >>>> to rely on it to know which events can be enabled and if some are > >>>> actually "silently enabled" when user space still thinks it needs to be > >>>> enabled the number of available counters becomes vague. > >>>> > >>>> It is not clear to me how to present hardware and software assignable > >>>> counters with a single consistent interface. Actually, what if the > >>>> "mbm_mode" is what distinguishes how counters are assigned instead of how > >>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and > >>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable" > >>>> and "mbm_cntr_group_assignable" is used? Could that replace a > >>>> potential "mbm_assign_events" while also supporting user space in > >>>> interactions with "mbm_control"? > >>> > >>> If I understand this correctly, is this a preference that the info > >>> node be named differently if its value will have different units, > >>> rather than a second node to indicate what the value of num_mbm_cntrs > >>> actually means? This sounds reasonable to me. > >> > >> Indeed. As you highlighted, user space may not need to know if > >> counters are backed by hardware or software, but user space needs to > >> know what to expect from (how to interact with) interface. > >> > >>> I think it's also important to note that in MPAM, the MBWU (memory > >>> bandwidth usage) monitors don't have a concept of local versus total > >>> bandwidth, so event assignment would likely not apply there either. > >>> What the counted bandwidth actually represents is more implicit in the > >>> monitor's position in the memory system in the particular > >>> implementation. On a theoretical multi-socket system, resctrl would > >>> require knowledge about the system's architecture to stitch together > >>> the counts from different types of monitors to produce a local and > >>> total value. I don't know if we'd program this SoC-specific knowledge > >>> into the kernel to produce a unified MBM resource like we're > >>> accustomed to now or if we'd present multiple MBM resources, each only > >>> providing an mbm_total_bytes event. In this case, the counters would > >>> have to be assigned separately in each MBM resource, especially if the > >>> different MBM resources support a different number of counters. > >>> > >> > >> "total" and "local" bandwidth is already in grey area after the > >> introduction of mbm_total_bytes_config/mbm_local_bytes_config where > >> user space could set values reported to not be constrained by the > >> "total" and "local" terms. We keep sticking with it though, even in > >> this implementation that uses the "t" and "l" flags, knowing that > >> what is actually monitored when "l" is set is just what the user > >> configured via mbm_local_bytes_config, which theoretically > >> can be "total" bandwidth. > > > > If it makes sense to support a separate, group-assignment interface at > > least for MPAM, this would be a better fit for soft-ABMC, even if it > > does have to stay downstream. > > (apologies for the delay) > > Could we please take a step back and confirm/agree what is meant with "group- > assignment"? In a previous message [1] I latched onto the statement > "the implementation is assigning RMIDs to groups, assignment results in all > events being counted.". In this I understood "groups" to be resctrl groups > and I understood this to mean that when a (soft-ABMC) counter is assigned > it applies to the entire resctrl group (all domains, all events). The > subsequent example in [2] was thus unexpected to me when the interface > was used to assign a (soft-ABMC) counter to the group but not all domains > were impacted. > > Considering this, could you please elaborate what is meant with > "group assignment"? By "group assignment", I just mean assigning counters to individual MBM events is not possible, or that assignment results in counters being assigned to all MBM events for a group in a domain. I only omitted per-domain assignment in soft-ABMC before because Google doesn't have a use-case for it. I started the prototype before Babu's proposed interface required domain-scoped assignments[1]. Now that some sort of domain selector is required, I'm reconsidering. -Peter [1] https://lore.kernel.org/lkml/cover.1705688538.git.babu.moger@xxxxxxx/