Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Peter Newman <peternewman@xxxxxxxxxx> · Fri, 21 Feb 2025 14:12:51 +0100

Hi Reinette,

On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
<reinette.chatre@xxxxxxxxx> wrote:
>
> Hi Peter,
>
> On 2/20/25 6:53 AM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> > <reinette.chatre@xxxxxxxxx> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>> Hi Reinette,
> >>>
> >>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>> <reinette.chatre@xxxxxxxxx> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>> Hi Reinette,
> >>>>>
> >>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>> <reinette.chatre@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Hi Babu,
> >>>>>>
> >>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>
> >>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>
> >>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>
> >>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>
> >>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>
> >>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>> <value>
> >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>
> >>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>
> >>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>> is low enough to be of concern.
> >>>>>
> >>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>> investigation, I would question whether they know what they're looking
> >>>>> for.
> >>>>
> >>>> The key here is "so far" as well as the focus on MBM only.
> >>>>
> >>>> It is impossible for me to predict what we will see in a couple of years
> >>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>> that he is working on patches [1] that will add new events and shared the idea
> >>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>> customers.
> >>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>
> >>> I was thinking of the letters as representing a reusable, user-defined
> >>> event-set for applying to a single counter rather than as individual
> >>> events, since MPAM and ABMC allow us to choose the set of events each
> >>> one counts. Wherever we define the letters, we could use more symbolic
> >>> event names.
> >>
> >> Thank you for clarifying.
> >>
> >>>
> >>> In the letters as events model, choosing the events assigned to a
> >>> group wouldn't be enough information, since we would want to control
> >>> which events should share a counter and which should be counted by
> >>> separate counters. I think the amount of information that would need
> >>> to be encoded into mbm_assign_control to represent the level of
> >>> configurability supported by hardware would quickly get out of hand.
> >>>
> >>> Maybe as an example, one counter for all reads, one counter for all
> >>> writes in ABMC would look like...
> >>>
> >>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>
> >>> (per domain)
> >>> group 0:
> >>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>> group 1:
> >>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>> ...
> >>>
> >>
> >> I think this may also be what Dave was heading towards in [2] but in that
> >> example and above the counter configuration appears to be global. You do mention
> >> "configurability supported by hardware" so I wonder if per-domain counter
> >> configuration is a requirement?
> >
> > If it's global and we want a particular group to be watched by more
> > counters, I wouldn't want this to result in allocating more counters
> > for that group in all domains, or allocating counters in domains where
> > they're not needed. I want to encourage my users to avoid allocating
> > monitoring resources in domains where a job is not allowed to run so
> > there's less pressure on the counters.
> >
> > In Dave's proposal it looks like global configuration means
> > globally-defined "named counter configurations", which works because
> > it's really per-domain assignment of the configurations to however
> > many counters the group needs in each domain.
>
> I think I am becoming lost. Would a global configuration not break your
> view of "event-set applied to a single counter"? If a counter is configured
> globally then it would not make it possible to support the full configurability
> of the hardware.
> Before I add more confusion, let me try with an example that builds on your
> earlier example copied below:
>
> >>> (per domain)
> >>> group 0:
> >>>  counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 1: VictimBW,LclNTWr,RmtNTWr
> >>> group 1:
> >>>  counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>  counter 3: VictimBW,LclNTWr,RmtNTWr
> >>> ...
>
> Since the above states "per domain" I rewrite the example to highlight that as
> I understand it:
>
> group 0:
>  domain 0:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  domain 0:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>
> You mention that you do not want counters to be allocated in domains that they
> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> in domain 1, resulting in:
>
> group 0:
>  domain 0:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  domain 0:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>
> With counter 0 and counter 1 available in domain 1, these counters could
> theoretically be configured to give group 1 more data in domain 1:
>
> group 0:
>  domain 0:
>   counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 1: VictimBW,LclNTWr,RmtNTWr
> group 1:
>  domain 0:
>   counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>   counter 3: VictimBW,LclNTWr,RmtNTWr
>  domain 1:
>   counter 0: LclFill,RmtFill
>   counter 1: LclNTWr,RmtNTWr
>   counter 2: LclSlowFill,RmtSlowFill
>   counter 3: VictimBW
>
> The counters are shown with different per-domain configurations that seems to
> match with earlier goals of (a) choose events counted by each counter and
> (b) do not allocate counters in domains where they are not needed. As I
> understand the above does contradict global counter configuration though.
> Or do you mean that only the *name* of the counter is global and then
> that it is reconfigured as part of every assignment?

Yes, I meant only the *name* is global. I assume based on a particular
system configuration, the user will settle on a handful of useful
groupings to count.

Perhaps mbm_assign_control syntax is the clearest way to express an example...

 # define global configurations (in ABMC terms), not necessarily in this
 # syntax and probably not in the mbm_assign_control file.

 r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
 w=VictimBW,LclNTWr,RmtNTWr

 # legacy "total" configuration, effectively r+w
 t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr

 /group0/0=t;1=t
 /group1/0=t;1=t
 /group2/0=_;1=t
 /group3/0=rw;1=_

- group2 is restricted to domain 0
- group3 is restricted to domain 1
- the rest are unrestricted
- In group3, we decided we need to separate read and write traffic

This consumes 4 counters in domain 0 and 3 counters in domain 1.

>
> >> Until now I viewed counter configuration separate from counter assignment,
> >> similar to how AMD's counters can be configured via mbm_total_bytes_config and
> >> mbm_local_bytes_config before they are assigned. That is still per-domain
> >> counter configuration though, not per-counter.
> >>
> >>> I assume packing all of this info for a group's desired counter
> >>> configuration into a single line (with 32 domains per line on many
> >>> dual-socket AMD configurations I see) would be difficult to look at,
> >>> even if we could settle on a single letter to represent each
> >>> universally.
> >>>
> >>>>
> >>>> My goal is for resctrl to have a user interface that can as much as possible
> >>>> be ready for whatever may be required from it years down the line. Of course,
> >>>> I may be wrong and resctrl would never need to support more than 26 events per
> >>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
> >>>> and how could resctrl support that?
> >>>>
> >>>> What is the risk of supporting more than 26 events? As I highlighted earlier
> >>>> the interface I used as demonstration may become unwieldy to parse on a system
> >>>> with many domains that supports many events. This is a concern for me. Any suggestions
> >>>> will be appreciated, especially from you since I know that you are very familiar with
> >>>> issues related to large scale use of resctrl interfaces.
> >>>
> >>> It's mainly just the unwieldiness of all the information in one file.
> >>> It's already at the limit of what I can visually look through.
> >>
> >> I agree.
> >>
> >>>
> >>> I believe that shared assignments will take care of all the
> >>> high-frequency and performance-intensive batch configuration updates I
> >>> was originally concerned about, so I no longer see much benefit in
> >>> finding ways to textually encode all this information in a single file
> >>> when it would be more manageable to distribute it around the
> >>> filesystem hierarchy.
> >>
> >> This is significant. The motivation for the single file was to support
> >> the "high-frequency and performance-intensive" usage. Would "shared assignments"
> >> not also depend on the same files that, if distributed, will require many
> >> filesystem operations?
> >> Having the files distributed will be significantly simpler while also
> >> avoiding the file size issue that Dave Martin exposed.
> >
> > The remaining filesystem operations will be assigning or removing
> > shared counter assignments in the applicable domains, which would
> > normally correspond to mkdir/rmdir of groups or changing their CPU
> > affinity. The shared assignments are more "program and forget", while
> > the exclusive assignment approach requires updates for every counter
> > (in every domain) every few seconds to cover a large number of groups.
> >
> > When they want to pay extra attention to a particular group, I expect
> > they'll ask for exclusive counters and leave them assigned for a while
> > as they collect extra data.
>
> The single file approach is already unwieldy. The demands that will be
> placed on it to support the usages currently being discussed would make this
> interface even harder to use and manage. If the single file is not required
> then I think we should go back to smaller files distributed in resctrl.
> This may not even be an either/or argument. One way to view mbm_assign_control
> could be as a way for user to interact with the distributed counter
> related files with a single file system operation. Although, without
> knowing how counter configuration is expected to work this remains unclear.

If we do both interfaces and the multi-file model gives us more
capability to express configurations, we could find situations where
there are configurations we cannot represent when reading back from
mbm_assign_control, or updates through mbm_assign_control have
ambiguous effects on existing configurations which were created with
other files.

However, the example I gave above seems to be adequately represented
by a minor extension to mbm_assign_control and we all seem to
understand it now, so maybe it's not broken yet. It's unfortunate that
work went into a requirement that's no longer relevant, but I don't
think that on its own is a blocker.

-Peter