Hi Reinette, On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote: > > Hi Peter, > > On 2/21/25 5:12 AM, Peter Newman wrote: > > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre > > <reinette.chatre@xxxxxxxxx> wrote: > >> On 2/20/25 6:53 AM, Peter Newman wrote: > >>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre > >>> <reinette.chatre@xxxxxxxxx> wrote: > >>>> On 2/19/25 3:28 AM, Peter Newman wrote: > >>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > >>>>> <reinette.chatre@xxxxxxxxx> wrote: > >>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: > >>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > >>>>>>> <reinette.chatre@xxxxxxxxx> wrote: > >>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >>>>>>>> > >>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) > >>>>>>>> > >>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>>>>>>>> Please help me understand if you see it differently. > >>>>>>>>>>>> > >>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>>>>>>>> > >>>>>>>>>>>> mbm_local_read_bytes a > >>>>>>>>>>>> mbm_local_write_bytes b > >>>>>>>>>>>> > >>>>>>>>>>>> Then mbm_assign_control can be used as: > >>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>>>>>>>> <value> > >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>>>>>>>> > >>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >>>>>>>> > >>>>>>>> As mentioned above, one possible issue with existing interface is that > >>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit > >>>>>>>> is low enough to be of concern. > >>>>>>> > >>>>>>> The events which can be monitored by a single counter on ABMC and MPAM > >>>>>>> so far are combinable, so 26 counters per group today means it limits > >>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained > >>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their > >>>>>>> investigation, I would question whether they know what they're looking > >>>>>>> for. > >>>>>> > >>>>>> The key here is "so far" as well as the focus on MBM only. > >>>>>> > >>>>>> It is impossible for me to predict what we will see in a couple of years > >>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > >>>>>> to support their users. Just looking at the Intel RDT spec the event register > >>>>>> has space for 32 events for each "CPU agent" resource. That does not take into > >>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > >>>>>> that he is working on patches [1] that will add new events and shared the idea > >>>>>> that we may be trending to support "perf" like events associated with RMID. I > >>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their > >>>>>> customers. > >>>>>> This all makes me think that resctrl should be ready to support more events than 26. > >>>>> > >>>>> I was thinking of the letters as representing a reusable, user-defined > >>>>> event-set for applying to a single counter rather than as individual > >>>>> events, since MPAM and ABMC allow us to choose the set of events each > >>>>> one counts. Wherever we define the letters, we could use more symbolic > >>>>> event names. > >>>> > >>>> Thank you for clarifying. > >>>> > >>>>> > >>>>> In the letters as events model, choosing the events assigned to a > >>>>> group wouldn't be enough information, since we would want to control > >>>>> which events should share a counter and which should be counted by > >>>>> separate counters. I think the amount of information that would need > >>>>> to be encoded into mbm_assign_control to represent the level of > >>>>> configurability supported by hardware would quickly get out of hand. > >>>>> > >>>>> Maybe as an example, one counter for all reads, one counter for all > >>>>> writes in ABMC would look like... > >>>>> > >>>>> (L3_QOS_ABMC_CFG.BwType field names below) > >>>>> > >>>>> (per domain) > >>>>> group 0: > >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>>> group 1: > >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>>> ... > >>>>> > >>>> > >>>> I think this may also be what Dave was heading towards in [2] but in that > >>>> example and above the counter configuration appears to be global. You do mention > >>>> "configurability supported by hardware" so I wonder if per-domain counter > >>>> configuration is a requirement? > >>> > >>> If it's global and we want a particular group to be watched by more > >>> counters, I wouldn't want this to result in allocating more counters > >>> for that group in all domains, or allocating counters in domains where > >>> they're not needed. I want to encourage my users to avoid allocating > >>> monitoring resources in domains where a job is not allowed to run so > >>> there's less pressure on the counters. > >>> > >>> In Dave's proposal it looks like global configuration means > >>> globally-defined "named counter configurations", which works because > >>> it's really per-domain assignment of the configurations to however > >>> many counters the group needs in each domain. > >> > >> I think I am becoming lost. Would a global configuration not break your > >> view of "event-set applied to a single counter"? If a counter is configured > >> globally then it would not make it possible to support the full configurability > >> of the hardware. > >> Before I add more confusion, let me try with an example that builds on your > >> earlier example copied below: > >> > >>>>> (per domain) > >>>>> group 0: > >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>>> group 1: > >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>>> ... > >> > >> Since the above states "per domain" I rewrite the example to highlight that as > >> I understand it: > >> > >> group 0: > >> domain 0: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> group 1: > >> domain 0: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> > >> You mention that you do not want counters to be allocated in domains that they > >> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 > >> in domain 1, resulting in: > >> > >> group 0: > >> domain 0: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> group 1: > >> domain 0: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> > >> With counter 0 and counter 1 available in domain 1, these counters could > >> theoretically be configured to give group 1 more data in domain 1: > >> > >> group 0: > >> domain 0: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> group 1: > >> domain 0: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 0: LclFill,RmtFill > >> counter 1: LclNTWr,RmtNTWr > >> counter 2: LclSlowFill,RmtSlowFill > >> counter 3: VictimBW > >> > >> The counters are shown with different per-domain configurations that seems to > >> match with earlier goals of (a) choose events counted by each counter and > >> (b) do not allocate counters in domains where they are not needed. As I > >> understand the above does contradict global counter configuration though. > >> Or do you mean that only the *name* of the counter is global and then > >> that it is reconfigured as part of every assignment? > > > > Yes, I meant only the *name* is global. I assume based on a particular > > system configuration, the user will settle on a handful of useful > > groupings to count. > > > > Perhaps mbm_assign_control syntax is the clearest way to express an example... > > > > # define global configurations (in ABMC terms), not necessarily in this > > # syntax and probably not in the mbm_assign_control file. > > > > r=LclFill,RmtFill,LclSlowFill,RmtSlowFill > > w=VictimBW,LclNTWr,RmtNTWr > > > > # legacy "total" configuration, effectively r+w > > t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > > > > /group0/0=t;1=t > > /group1/0=t;1=t > > /group2/0=_;1=t > > /group3/0=rw;1=_ > > > > - group2 is restricted to domain 0 > > - group3 is restricted to domain 1 > > - the rest are unrestricted > > - In group3, we decided we need to separate read and write traffic > > > > This consumes 4 counters in domain 0 and 3 counters in domain 1. > > > > I see. Thank you for the example. > > resctrl supports per-domain configurations with the following possible when > using mbm_total_bytes_config and mbm_local_bytes_config: > > t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr > > /group0/0=t;1=t > /group1/0=t;1=t > > Even though the flags are identical in all domains, the assigned counters will > be configured differently in each domain. > > With this supported by hardware and currently also supported by resctrl it seems > reasonable to carry this forward to what will be supported next. The hardware supports both a per-domain mode, where all groups in a domain use the same configurations and are limited to two events per group and a per-group mode where every group can be configured and assigned freely. This series is using the legacy counter access mode where only counters whose BwType matches an instance of QOS_EVT_CFG_n in the domain can be read. If we chose to read the assigned counter directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) rather than asking the hardware to find the counter by RMID, we would not be limited to 2 counters per group/domain and the hardware would have the same flexibility as on MPAM. (I might have said something confusing in my last messages because I had forgotten that I switched to the extended assignment mode when prototyping with soft-ABMC and MPAM.) Forcing all groups on a domain to share the same 2 counter configurations would not be acceptable for us, as the example I gave earlier is one I've already been asked about. I'm worried about requiring support for domain-level mbm_total_bytes_config and mbm_local_bytes_config files to be carried forward, because this conflicts with the configuration being per group/domain. (i.e., what would be read back from the domain files if per-group customizations have already been applied?) > > >> > >>>> Until now I viewed counter configuration separate from counter assignment, > >>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and > >>>> mbm_local_bytes_config before they are assigned. That is still per-domain > >>>> counter configuration though, not per-counter. > >>>> > >>>>> I assume packing all of this info for a group's desired counter > >>>>> configuration into a single line (with 32 domains per line on many > >>>>> dual-socket AMD configurations I see) would be difficult to look at, > >>>>> even if we could settle on a single letter to represent each > >>>>> universally. > >>>>> > >>>>>> > >>>>>> My goal is for resctrl to have a user interface that can as much as possible > >>>>>> be ready for whatever may be required from it years down the line. Of course, > >>>>>> I may be wrong and resctrl would never need to support more than 26 events per > >>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events > >>>>>> and how could resctrl support that? > >>>>>> > >>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier > >>>>>> the interface I used as demonstration may become unwieldy to parse on a system > >>>>>> with many domains that supports many events. This is a concern for me. Any suggestions > >>>>>> will be appreciated, especially from you since I know that you are very familiar with > >>>>>> issues related to large scale use of resctrl interfaces. > >>>>> > >>>>> It's mainly just the unwieldiness of all the information in one file. > >>>>> It's already at the limit of what I can visually look through. > >>>> > >>>> I agree. > >>>> > >>>>> > >>>>> I believe that shared assignments will take care of all the > >>>>> high-frequency and performance-intensive batch configuration updates I > >>>>> was originally concerned about, so I no longer see much benefit in > >>>>> finding ways to textually encode all this information in a single file > >>>>> when it would be more manageable to distribute it around the > >>>>> filesystem hierarchy. > >>>> > >>>> This is significant. The motivation for the single file was to support > >>>> the "high-frequency and performance-intensive" usage. Would "shared assignments" > >>>> not also depend on the same files that, if distributed, will require many > >>>> filesystem operations? > >>>> Having the files distributed will be significantly simpler while also > >>>> avoiding the file size issue that Dave Martin exposed. > >>> > >>> The remaining filesystem operations will be assigning or removing > >>> shared counter assignments in the applicable domains, which would > >>> normally correspond to mkdir/rmdir of groups or changing their CPU > >>> affinity. The shared assignments are more "program and forget", while > >>> the exclusive assignment approach requires updates for every counter > >>> (in every domain) every few seconds to cover a large number of groups. > >>> > >>> When they want to pay extra attention to a particular group, I expect > >>> they'll ask for exclusive counters and leave them assigned for a while > >>> as they collect extra data. > >> > >> The single file approach is already unwieldy. The demands that will be > >> placed on it to support the usages currently being discussed would make this > >> interface even harder to use and manage. If the single file is not required > >> then I think we should go back to smaller files distributed in resctrl. > >> This may not even be an either/or argument. One way to view mbm_assign_control > >> could be as a way for user to interact with the distributed counter > >> related files with a single file system operation. Although, without > >> knowing how counter configuration is expected to work this remains unclear. > > > > If we do both interfaces and the multi-file model gives us more > > capability to express configurations, we could find situations where > > there are configurations we cannot represent when reading back from > > mbm_assign_control, or updates through mbm_assign_control have > > ambiguous effects on existing configurations which were created with > > other files. > > Right. My assumption was that the syntax would be identical. > > > > > However, the example I gave above seems to be adequately represented > > by a minor extension to mbm_assign_control and we all seem to > > To confirm what you mean with "minor extension to mbm_assign_control", > is this where the flags are associated with counter configurations? At this > time this is done separately from mbm_assign_control with the hardcoded "t" > and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes > respectively. I think it would be simpler to keep these configurations > separate from mbm_assign_control. How it would look without better > understanding of MPAM is not clear to me at this time, unless if the > requirement is to enhance support for ABMC and BMEC. I do see that > this can be added later to build on what is supported by mbm_assign_control > with the syntax in this version. As I explained above, I was looking at this from the perspective of the extended event assignment mode. Thanks, -Peter