Hi Reinette, On 11/15/24 18:00, Reinette Chatre wrote: > Hi Babu, > > On 10/29/24 4:21 PM, Babu Moger wrote: >> Introduce the interface file "mbm_assign_mode" to list monitor modes >> supported. >> >> The "mbm_cntr_assign" mode provides the option to assign a counter to >> an RMID, event pair and monitor the bandwidth as long as it is assigned. >> >> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable >> Bandwidth Monitoring Counters) hardware feature and is enabled by default. >> >> The "default" mode is the existing monitoring mode that works without the >> explicit counter assignment, instead relying on dynamic counter assignment >> by hardware that may result in hardware not dedicating a counter resulting >> in monitoring data reads returning "Unavailable". >> >> Provide an interface to display the monitor mode on the system. >> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >> [mbm_cntr_assign] >> default >> >> Signed-off-by: Babu Moger <babu.moger@xxxxxxx> >> --- >> v9: Updated user documentation based on comments. >> >> v8: Commit message update. >> >> v7: Updated the descriptions/commit log in resctrl.rst to generic text. >> Thanks to James and Reinette. >> Rename mbm_mode to mbm_assign_mode. >> Introduced mutex lock in rdtgroup_mbm_mode_show(). >> >> v6: Added documentation for mbm_cntr_assign and legacy mode. >> Moved mbm_mode fflags initialization to static initialization. >> >> v5: Changed interface name to mbm_mode. >> It will be always available even if ABMC feature is not supported. >> Added description in resctrl.rst about ABMC mode. >> Fixed display abmc and legacy consistantly. >> >> v4: Fixed the checks for legacy and abmc mode. Default it ABMC. >> >> v3: New patch to display ABMC capability. >> --- >> Documentation/arch/x86/resctrl.rst | 33 ++++++++++++++++++++++++++ >> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++ >> 2 files changed, 64 insertions(+) >> >> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst >> index 30586728a4cd..a93d7980e25f 100644 >> --- a/Documentation/arch/x86/resctrl.rst >> +++ b/Documentation/arch/x86/resctrl.rst >> @@ -257,6 +257,39 @@ with the following files: >> # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config >> 0=0x30;1=0x30;3=0x15;4=0x15 >> >> +"mbm_assign_mode": >> + Reports the list of monitoring modes supported. The enclosed brackets >> + indicate which mode is enabled. >> + :: >> + >> + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >> + [mbm_cntr_assign] >> + default >> + >> + "mbm_cntr_assign": >> + >> + In mbm_cntr_assign mode user-space is able to specify which of the >> + events in CTRL_MON or MON groups should have a counter assigned using the >> + "mbm_assign_control" file. The number of counters available is described >> + in the "num_mbm_cntrs" file. Changing the mode may cause all counters on >> + a resource to reset. >> + >> + The mode is useful on platforms which support more CTRL_MON and MON >> + groups than the hardware counters, meaning 'unassigned' events on CTRL_MON or > > " than the hardware counters" -> " than hardware counters"? Sure. > >> + MON groups will report 'Unavailable' or count the traffic in an unpredictable >> + way. > > I think the above can be confusing to users. It mentioned "*will* report Unavailable" > and then "*or* count the traffic in an unpredictable way". It is not possible for > counter to report "Unavailable" while also reporting unpredictable data. > > My concern is that there is no way for a user to know if the platform supports more > CTRL_MON and MON groups than hardware counters and the above seems to imply that counters > may be unreliable ... so how does a user know if counters are unreliable or not? That is correct. There is no definite way to find out if the counters are unreliable. > > Can this be made specific to help users know if their platforms are impacted? From > what I know all AMD platforms are impacted so perhaps a straight-forward: > > "The mode is useful on AMD platforms which support more CTRL_MON and MON ..." Sure. > > I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode > to make the event data "more predictable" and then be concerned when the mode does > not exist. > > As an alternative, is it possible to know the number of hardware counters on AMD systems > without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for > users to know if their platform may be impacted by this type of "unpredictability" (by comparing > num_mbm_cntrs to num_rmids). There is some round about(or hacky) way to find that out number of RMIDs that can be active. > >> + >> + AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature >> + enable this mode by default so that counters remain assigned even when the >> + corresponding RMID is not in use by any processor. >> + >> + "default": >> + >> + In default mode resctrl assumes there is a hardware counter for each >> + event within every CTRL_MON and MON group. Reading mbm_total_bytes or >> + mbm_local_bytes may report 'Unavailable' if there is no counter associated >> + with that event. > > If I understand correctly, on AMD platforms without ABMC the events only report > "Unavailable" if there is no counter assigned at the time of the query. If a counter > is unassigned and then reassigned then the event count will reset and the user > will get some data back but it may thus be unpredictable (to match earlier language). > Is this correct? Any AMD platform in "default" mode may thus be vulnerable to > "unpredictable" event counts (not just "Unavailable") ... this gets complicated Yes. All the AMD systems without ABMC are affected by this problem. > because users should be steered to avoid "default" mode if mbm_assign_mode is > available, while not be made concerned to use "default" mode on Intel where > mbm_assign_mode is not available. Can we add text to clarify this? > > Reinette > > -- Thanks Babu Moger