Hi Babu, On 11/22/24 10:25 AM, Moger, Babu wrote: > Hi Reinette, > > On 11/18/2024 4:07 PM, Reinette Chatre wrote: >> Hi Babu, >> >> On 11/18/24 11:04 AM, Moger, Babu wrote: >>> Hi Reinette, >>> >>> On 11/15/24 18:00, Reinette Chatre wrote: >>>> Hi Babu, >>>> >>>> On 10/29/24 4:21 PM, Babu Moger wrote: >>>>> Introduce the interface file "mbm_assign_mode" to list monitor modes >>>>> supported. >>>>> >>>>> The "mbm_cntr_assign" mode provides the option to assign a counter to >>>>> an RMID, event pair and monitor the bandwidth as long as it is assigned. >>>>> >>>>> On AMD systems "mbm_cntr_assign" is backed by the ABMC (Assignable >>>>> Bandwidth Monitoring Counters) hardware feature and is enabled by default. >>>>> >>>>> The "default" mode is the existing monitoring mode that works without the >>>>> explicit counter assignment, instead relying on dynamic counter assignment >>>>> by hardware that may result in hardware not dedicating a counter resulting >>>>> in monitoring data reads returning "Unavailable". >>>>> >>>>> Provide an interface to display the monitor mode on the system. >>>>> $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >>>>> [mbm_cntr_assign] >>>>> default >>>>> >>>>> Signed-off-by: Babu Moger <babu.moger@xxxxxxx> >>>>> --- >> >> ... >> >>>> I'm concerned that users with Intel platforms may want to use the "mbm_cntr_assign" mode >>>> to make the event data "more predictable" and then be concerned when the mode does >>>> not exist. >>>> >>>> As an alternative, is it possible to know the number of hardware counters on AMD systems >>>> without ABMC? I wonder if we could perhaps always expose num_mbm_cntrs as a way for >>>> users to know if their platform may be impacted by this type of "unpredictability" (by comparing >>>> num_mbm_cntrs to num_rmids). >>> >>> There is some round about(or hacky) way to find that out number of RMIDs >>> that can be active. >> >> Does this give consistent and accurate data? Is this something that can be added to resctrl? >> (Reading your other message [1] it does not sound as though it can produce an accurate >> number on boot.) >> If not then it will be up to the documentation to be accurate. >> >> >>>>> + >>>>> + AMD Platforms with ABMC (Assignable Bandwidth Monitoring Counters) feature >>>>> + enable this mode by default so that counters remain assigned even when the >>>>> + corresponding RMID is not in use by any processor. >>>>> + >>>>> + "default": >>>>> + >>>>> + In default mode resctrl assumes there is a hardware counter for each >>>>> + event within every CTRL_MON and MON group. Reading mbm_total_bytes or >>>>> + mbm_local_bytes may report 'Unavailable' if there is no counter associated >>>>> + with that event. >>>> >>>> If I understand correctly, on AMD platforms without ABMC the events only report >>>> "Unavailable" if there is no counter assigned at the time of the query. If a counter >>>> is unassigned and then reassigned then the event count will reset and the user >>>> will get some data back but it may thus be unpredictable (to match earlier language). >>>> Is this correct? Any AMD platform in "default" mode may thus be vulnerable to >>>> "unpredictable" event counts (not just "Unavailable") ... this gets complicated >>> >>> Yes. All the AMD systems without ABMC are affected by this problem. >>> >>>> because users should be steered to avoid "default" mode if mbm_assign_mode is >>>> available, while not be made concerned to use "default" mode on Intel where >>>> mbm_assign_mode is not available. >>> >>> Can we add text to clarify this? >> >> Please do. > > I think we need to add text about AMD systems. How about this? > > "default": > In default mode resctrl assumes there is a hardware counter for each > event within every CTRL_MON and MON group. On AMD systems with 16 more monitoring groups, reading mbm_total_bytes or mbm_local_bytes may report 'Unavailable' if there is no counter associated with that event. It is therefore recommended to use the 'mbm_cntr_assign' mode, if supported." What is meant with "On AMD systems with 16 more monitoring groups"? First, the language is not clear, second, you mentioned earlier that there is just a "hacky" way to determine number of RMIDs that can be active but here "16" is made official in the documentation? Reinette