Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Peter Newman <peternewman@xxxxxxxxxx> · Thu, 15 Aug 2024 16:06:22 -0700

Hi Reinette,

On Wed, Aug 14, 2024 at 10:37 AM Reinette Chatre
<reinette.chatre@xxxxxxxxx> wrote:
>
> Hi Peter,
>
> On 8/2/24 3:50 PM, Peter Newman wrote:
> > On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre
> > <reinette.chatre@xxxxxxxxx> wrote:
> >> On 8/2/24 11:49 AM, Peter Newman wrote:
> >>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
> >>>> I am of course not familiar with details of the software implementation
> >>>> - could there be benefits to using it even if hardware counters are
> >>>> supported?
> >>>
> >>> I can't see any situation where the user would want to choose software
> >>> over hardware counters. The number of groups which can be monitored by
> >>> software assignable counters will always be less than with hardware,
> >>> due to the need for consuming one RMID (and the counters automatically
> >>> allocated to it by the AMD hardware) for all unassigned groups.
> >>
> >> Thank you for clarifying. This seems specific to this software implementation,
> >> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
> >> correctly this depends on undocumented hardware specific knowledge.
> >
> > For the benefit of anyone else who needs to monitor bandwidth on a
> > large number of monitoring groups on pre-ABMC AMD implementations,
> > hopefully a future AMD publication will clarify, at least on some
> > existing, pre-ABMC models, exactly when the QM_CTR.U bit is set.
> >
> >
> >>>
> >>> The behavior as I've implemented today is:
> >>>
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
> >>> 0
> >>>
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> test//0=_;1=_;
> >>> //0=_;1=_;
> >>>
> >>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> test//0=_;1=tl;
> >>> //0=_;1=_;
> >>>
> >>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> test//0=_;1=_;
> >>> //0=_;1=_;
> >>>
> >>>
> >>
> >> This highlights how there cannot be a generic/consistent interface between hardware
> >> and software implementation. If resctrl implements something like above without any
> >> other hints to user space then it will push complexity to user space since user space
> >> would not know if setting one flag results in setting more than that flag, which may
> >> force a user space implementation to always follow a write with a read that
> >> needs to confirm what actually resulted from the write. Similarly, that removing a
> >> flag impacts other flags needs to be clear without user space needing to "try and
> >> see what happens".
> >
> > I'll return to this topic in the context of MPAM below...
> >
> >> It is not clear to me how to interpret the above example when it comes to the
> >> RMID management though. If the RMID assignment is per group then I expected all
> >> the domains of a group to have the same flag(s)?
> >
> > The group RMIDs are never programmed into any MSRs and the RMID space
> > is independent in each domain, so it is still possible to do
> > per-domain assignment. (and like with soft RMIDs, this enables us to
> > create unlimited groups, but we've never been limited by the size of
> > the RMID space)
> >
> > However, in our use cases, jobs are not confined to any domain, so
> > bandwidth measurements must be done simultaneously in all domains, so
> > we have no current use for per-domain assignment. But if any Google
> > users did begin to see value in confining jobs to domains, this could
> > change.
> >
> >>
> >>>>
> >>>>> However, If we don't expect to see these semantics in any other
> >>>>> implementation, these semantics could be implicit in the definition of
> >>>>> a SW assignable counter.
> >>>>
> >>>> It is not clear to me how implementation differences between hardware
> >>>> and software assignment can be hidden from user space. It is possible
> >>>> to let user space enable individual events and then silently upgrade it
> >>>> to all events. I see two options here, either "mbm_control" needs to
> >>>> explicitly show this "silent upgrade" so that user space knows which
> >>>> events are actually enabled, or "mbm_control" only shows flags/events enabled
> >>>> from user space perspective. In the former scenario, this needs more
> >>>> user space support since a generic user space cannot be confident which
> >>>> flags are set after writing to "mbm_control". In the latter scenario,
> >>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
> >>>> to rely on it to know which events can be enabled and if some are
> >>>> actually "silently enabled" when user space still thinks it needs to be
> >>>> enabled the number of available counters becomes vague.
> >>>>
> >>>> It is not clear to me how to present hardware and software assignable
> >>>> counters with a single consistent interface. Actually, what if the
> >>>> "mbm_mode" is what distinguishes how counters are assigned instead of how
> >>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
> >>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
> >>>> and "mbm_cntr_group_assignable" is used? Could that replace a
> >>>> potential "mbm_assign_events" while also supporting user space in
> >>>> interactions with "mbm_control"?
> >>>
> >>> If I understand this correctly, is this a preference that the info
> >>> node be named differently if its value will have different units,
> >>> rather than a second node to indicate what the value of num_mbm_cntrs
> >>> actually means? This sounds reasonable to me.
> >>
> >> Indeed. As you highlighted, user space may not need to know if
> >> counters are backed by hardware or software, but user space needs to
> >> know what to expect from (how to interact with) interface.
> >>
> >>> I think it's also important to note that in MPAM, the MBWU (memory
> >>> bandwidth usage) monitors don't have a concept of local versus total
> >>> bandwidth, so event assignment would likely not apply there either.
> >>> What the counted bandwidth actually represents is more implicit in the
> >>> monitor's position in the memory system in the particular
> >>> implementation. On a theoretical multi-socket system, resctrl would
> >>> require knowledge about the system's architecture to stitch together
> >>> the counts from different types of monitors to produce a local and
> >>> total value. I don't know if we'd program this SoC-specific knowledge
> >>> into the kernel to produce a unified MBM resource like we're
> >>> accustomed to now or if we'd present multiple MBM resources, each only
> >>> providing an mbm_total_bytes event. In this case, the counters would
> >>> have to be assigned separately in each MBM resource, especially if the
> >>> different MBM resources support a different number of counters.
> >>>
> >>
> >> "total" and "local" bandwidth is already in grey area after the
> >> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
> >> user space could set values reported to not be constrained by the
> >> "total" and "local" terms. We keep sticking with it though, even in
> >> this implementation that uses the "t" and "l" flags, knowing that
> >> what is actually monitored when "l" is set is just what the user
> >> configured via mbm_local_bytes_config, which theoretically
> >> can be "total" bandwidth.
> >
> > If it makes sense to support a separate, group-assignment interface at
> > least for MPAM, this would be a better fit for soft-ABMC, even if it
> > does have to stay downstream.
>
> (apologies for the delay)
>
> Could we please take a step back and confirm/agree what is meant with "group-
> assignment"? In a previous message [1] I latched onto the statement
> "the implementation is assigning RMIDs to groups, assignment results in all
> events being counted.". In this I understood "groups" to be resctrl groups
> and I understood this to mean that when a (soft-ABMC) counter is assigned
> it applies to the entire resctrl group (all domains, all events). The
> subsequent example in [2] was thus unexpected to me when the interface
> was used to assign a (soft-ABMC) counter to the group but not all domains
> were impacted.
>
> Considering this, could you please elaborate what is meant with
> "group assignment"?

By "group assignment", I just mean assigning counters to individual
MBM events is not possible, or that assignment results in counters
being assigned to all MBM events for a group in a domain.

I only omitted per-domain assignment in soft-ABMC before because
Google doesn't have a use-case for it. I started the prototype before
Babu's proposed interface required domain-scoped assignments[1]. Now
that some sort of domain selector is required, I'm reconsidering.

-Peter

[1] https://lore.kernel.org/lkml/cover.1705688538.git.babu.moger@xxxxxxx/