Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

"Moger, Babu" <babu.moger@xxxxxxx> · Wed, 28 Feb 2024 11:59:19 -0600

Hi Reinette,

On 2/27/24 17:50, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>> On 2/26/24 15:20, Reinette Chatre wrote:
>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/23/24 16:21, Reinette Chatre wrote:
> 
>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>>     system, where the previous "num_rmids" monitor groups can be created but
>>>>>     the counters are reset unpredictably ... should this still be supported
>>>>>     on ABMC systems though?
>>>>
>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>
>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>
>> The debug option gives wrong impression. It is better to keep the option
>> open to enable the feature in normal mode.
> 
> You mentioned that it would only be needed when there are hardware or
> software issues ... so debug does sound appropriate. Could you please give
> an example of how debug option gives wrong impression? Why would you want
> users to keep using "legacy" mode on an ABMC system?

I don't have a strong argument here. I am fine as long as there is a way
to go back to legacy mode if required. We can provide legacy option in
debug mode.

> 
> ...
> 
>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>> User space could theoretically create more monitor groups than the number of
>>> rmids that a resource claims to support using current upstream enumeration.
>>
>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>> more than this limit(r->num_rmid).
>>
>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>> RMID to assign the monitoring. So, assignment limit is
>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
> 
> I see. Thank you for clarifying. This does make enabling simpler and one
> less user interface item that needs changing.
> 
> ...
> 
>>>> 2. /sys/fs/resctrl/monitor_state.
>>>> This can used to individually assign or unassign the counters in each group.
>>>>
>>>> When assigned:
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>
>>>> When unassigned:
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>
>>>>
>>>> Thoughts?
>>>
>>> How do you expect this interface to be used? I understand the mechanics
>>> of this interface but on a higher level, do you expect user space to
>>> once in a while assign a new counter to a single event or monitor group
>>> (for which a fine grained interface works) or do you expect user space to
>>> shift multiple counters across several monitor events at intervals?
>>
>> I think we should provide both the options. I was thinking of providing
>> fine grained interface first.
> 
> Could you please provide a motivation for why two interfaces, one inefficient
> and one not, should be created and maintained? Users can still do fine grained
> assignment with a global assignment interface.

Lets consider one by one.

1. Fine grained assignment.

It will be part of the mongroup(or control mongroup). User has the access
to the group and can query the group's current status before assigning or
unassigning.

   $cd /sys/fs/resctrl/ctrl_mon1
   $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
       0=total-unassign,local-unassign;1=total-unassign,local-unassign;

Assign the total event

  $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Assign the local event

   $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Assign both events:

   $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Check the assignment status.

   $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
       0=total-assign,local-assign;1=total-unassign,local-unassign;

-User interface is simple.

-Assignment will fail if all the h/w counters are exhausted. User needs to
unassign a counter from another group and use that counter here. This can
be done just querying the monitor state of another group.

-Monitor group's details(cpus, tasks) are part of the group. So, it is
better to have assignment state inside the group.

Note: Used interface names here just to give example.

2. global assignment:

I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
directory.

In case there are 100 mongroups, we need to have a way to list current
assignment status for these groups. I am not sure how to list status of
these 100 groups.

If user is wants to assign the local event(or total) in a specific group
in this list of 100 groups, I am not sure how to provide interface for
that. Should we pass the name of mongroup? That will involve looping
through using the call kernfs_walk_and_get. This may be ok if we are
dealing with very small number of groups.

-- 
Thanks
Babu Moger