Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dave,

On 2/13/25 9:37 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>> Hi there,
>>>
>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>
>>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>
>>>> Series is written such that it is easier to support other assignable
>>>> features supported from different vendors.
>>>>
>>>> The feature details are documented in the  APM listed below [1].
>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>> Monitoring (ABMC). The documentation is available at
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>
>>>> The patches are based on top of commit
>>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
> 
> [...]
> 
>>>> b. Check how many ABMC counters are available. 
>>>>
>>>> 	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs 
>>>> 	32
>>>
>>> Is this file needed?
>>>
>>> With MPAM, it is more difficult to promise that the same number of
>>> counters will be available everywhere.
>>>
>>> Rather than lie, or report a "safe" value here that may waste some
>>> counters, can we just allow the number of counters to be be discovered
>>> per domain via available_mbm_cntrs?
>>
>> This sounds reasonable to me. I think us having trouble with the
>> user documentation of this file so late in development should also have been
>> a sign to rethink its value.
>>
>> For a user to discover the number of counters supported via available_mbm_cntrs
>> would require the file's contents to be captured right after mount. Since we've
>> had scenarios where new userspace needs to discover an up-and-running system's
>> configuration this may not be possible. I thus wonder instead of removing
>> num_mbm_cntrs, it could be modified to return the per-domain supported counters
>> instead of a single value? 
> 
> Is it actually useful to be able to discover the number of counters
> that exist?  A counter that exists but is not available cannot be used,
> so perhaps it is not useful to know about it in the first place.

An alternative perspective of what "available" means is "how many counters
could I possibly get to do this new monitoring task". A user may be willing
to re-assign counters if the new monitoring task is important. Knowing
how many counters are already free and available for assignment would be
easy from available_mbm_cntrs but to get an idea of how many counters
could be re-assigned to help out with the new task would require
some intricate parsing of mbm_assign_control.


> But if we keep this file but make it report the number of counters for
> each domain (similarly to mbm_available_cntrs), then I think the MPAM
> driver should be able to work with that.
> 
>>> num_closids and num_rmids are already problematic for MPAM, so it would
>>> be good to avoid any more parameters of this sort from being reported
>>> to userspace unless there is a clear understanding of why they are
>>> needed.
>>
>> Yes. Appreciate your help in identifying what could be problematic for MPAM.
> 
> For clarity: this is a background issue, mostly orthogonal to this
> series.
> 
> If this series is merged as-is, with a global per-resource
> num_mbm_cntrs property, then this not really worse than the current
> situation -- it's just a bit annoying from the MPAM perspective.
> 
> 
> In a nutshell, the num_closids / num_rmids parameters seem to expose
> RDT-specific hardware semantics to userspace, implying a specific
> allocation model for control group and monitoring group identifiers.
> 
> The guarantees that userspace is entitled to asssume when resctrl
> reports particular values do not seem to be well described and are hard
> to map onto the nearest-equivalent MPAM implementation.  A combination
> of control and monitoring groups that can be created on x86 may not be
> creatable on MPAM, even when the number of supportable control and
> monitoring partitions is the same.

I understand. This interface was created almost a decade ago. It would have been
wonderful if the user interface could have been created with a clear vision
of all the use cases it would end up needing to support. I am trying to be
very careful with this new user interface as I try to consider all the things I
learned while working on resctrl. All help get this new interface right is
greatly appreciated.

Since your specifically mention issues that MPAM has with num_rmids, please
note that we have been trying (see [1], but maybe start reading thread at [2])
to find ways to make this work with MPAM but no word from MPAM side. 
I see that you were not cc'd on the discussion so this is not a criticism of
you personally but I would like to highlight that we do try to make things
work well for MPAM but so far this work seems ignored, yet critisized
for not being done. I expect the more use cases are thrown at an interface
as it is developed the better it would get and I would gladly work with MPAM
folks to improve things.

> Even with the ABMC series, we may still be constrained on what we can
> report for num_rmids: we can't know in advance whether or not the user
> is going to use mbm_cntr_assign mode -- if not, we can't promise to
> create more monitoring groups than the number of counters in the
> hardware.

It is the architecture that decides which modes are supported and
which is default.

> It seems natural for the counts reported by "available_mbm_cntrs" to
> change dynamically when the ABMC assignment mode is changed, but I
> think userspace are likely to expect the global "num_rmids" parameters
> to be fixed for the lifetime of the resctrl mount (and possibly fixed
> for all time on a given hardware platform -- at least, modulo CDP).
> 
> 
> I think it might be possible to tighten up the docmentation of
> num_closids in particular in a way that doesn't conflict with x86 and
> may make it easier for MPAM to fit in with, but that feels like a
> separate conversation.
> 
> None of this should be considered a blocker for this series, either way.
> 
>>>
>>> Reporting number of counters per monitoring domain is a more natural
>>> fit for MPAM, as below:
>>>
>>>> c. Check how many ABMC counters are available in each domain.
>>>>
>>>> 	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs 
>>>> 	0=30;1=30
>>>
>>> For MPAM, this seems supportable.  Each monitoring domain will have
>>> some counters, and a well-defined number of them will be available for
>>> allocation at any one time.
> 
> [...]
> 
>>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
> [...]
> 
>>>>        Flags can be one of the following:
>>>>
>>>>         t  MBM total event is enabled.
>>>>         l  MBM local event is enabled.
>>>>         tl Both total and local MBM events are enabled.
>>>>         _  None of the MBM events are enabled
>>>>
>>>> 	Examples:
>>>
>>> [...]
>>>
>>> I think that this basically works for MPAM.
>>>
>>> The local/total distinction doesn't map in a consistent way onto MPAM,
>>> but this problem is not specific to ABMC.  It feels sensible for ABMC
>>> to be built around the same concepts that resctrl already has elsewhere
>>> in the interface.  MPAM will do its best to fit (as already).
>>>
>>> Regarding Peter's use case of assiging multiple counters to a
>>> monitoring group [1], I feel that it's probably good enough to make
>>> sure that the ABMC interface can be extended in future in a backwards
>>> compatible way so as to support this, without trying to support it
>>> immediately.
>>>
>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@xxxxxxxxxxxxxx/
>>>
>>
>> I do not think that resctrl's current support of the mbm_total_bytes and
>> mbm_local_bytes should be considered as the "only" two available "slots"
>> into which all possible events should be forced into. "mon_features" exists
>> to guide user space to which events are supported and as I see it new events
>> can be listed here to inform user space of their availability, with their
>> associated event files available in the resource groups.
> 
> That's fair.  I wasn't currently sure how (or if) the set of countable
> events was expected to grow / evolve via this route.
> 
> Either way, I think this confirms that there is at least one viable way
> to enable more counters for a single control group, on top of this
> series.
> 
> (If there is more than one way, that seems fine?)
> 
>>>
>>> For example, if we added new generic "letters" -- say, "0" to "9",
>>> combined with new counter files in resctrlfs, that feels like a
>>> possible approach.  ABMC (as in this series) should just reject such
>>> such assignments, and the new counter files wouldn't exist.
>>>
>>> Availability of this feature could also be reported as a distinct mode
>>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>>
>>>
>>> A _sketch_ of this follows.  This is NOT a proposal -- the key
>>> question is whether we are confident that we can extend the interface
>>> in this way in the future without breaking anything.
>>>
>>> If "yes", then the ABMC interface (as proposed by this series) works as
>>> a foundation to build on.
>>>
>>> --8<--
>>>
>>> [artists's impression]
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>  	mbm_cntr_generic
>>>  	[mbm_cntr_assign]
>>>  	default
>>>
>>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type 
>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type 
>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type 
>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type 
>>>
>>> ...
>>>
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>>
>>> etc.
>>>
>>
>> It is not clear to me what additional features such an interface enables. It
>> also looks like user space will need to track and manage counter IDs?
> 
> My idea was that for these generic counters, new files could be exposed
> to configure what they actually count (the ..._type files shown above;
> or possibly via the ..._config files that already exist).
> 
> The "IDs" were inteded as abstract; the number only relates the
> assignments in mbm_assign_control to the files created elsewhere.  This
> wouldn't be related to IDs assigned by the hardware.

I see. Yes, this sounds related to and a generalization of the AMD
configurable event feature.

> 
> If there are multiple resctrl users then using numeric IDs might be
> problematic; though if we go eventually in the direction of making
> resctrlfs multi-mountable then each mount could have its own namespace.

I am not aware of "multi-mountable" direction.

> 
> Allowing counters to be named and configured with a mkdir()-style
> interface might be possible too; that might make it easier for users to
> coexist within a single resctrl mount (if we think that's important
> enough).
> 
>> It sounds to me as though the issue starts with your statement
>> "The local/total distinction doesn't map in a consistent way onto MPAM". To
>> address this I expect that an MPAM system will not support nor list
>> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
>> it would list the events that are appropriate to the system? Trying to match
>> with what Peter said [1] in the message you refer to, this may be possible:
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> mbm_local_read_bytes
>> mbm_local_write_bytes
>> mbm_local_bytes
>>
>> (*) I am including mbm_local_bytes since it could be an event that can be software
>> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
>> counted.
>>
>> I see the support for MPAM events distinct from the support of assignable counters.
>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>> Please help me understand if you see it differently.
>> 	
>> Doing so would need to come up with alphabetical letters for these events,
>> which seems to be needed for your proposal also? If we use possible flags of:
>>
>> mbm_local_read_bytes a
>> mbm_local_write_bytes b
>>
>> Then mbm_assign_control can be used as:
>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>> <value>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>
>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@xxxxxxxxxxxxxx/
> 
> That approach would also work, where an MPAM system has events are not
> a reasonable approximation of the generic "total" or "local".
> 
> For now we would probably stick with "total" and "local" anyway though,
> because the MPAM architecture doesn't natively allow the mapping onto
> the memory system topology to be discovered, and the information in
> ACPI / device tree is insufficient to tell us everything we'd need to
> know.  But I guess what counts as "local" in particular will be quite
> hardware and topology dependent even on x86, so perhaps we shouldn't
> worry about having the behaviour match exactly (?)
> 
> Regarding the code letters, my idea was that the event type might be
> configured by a separate file, instead of in mbm_assign_control
> directly, in which case running out of letters wouldn't be a problem.

This work started with individual files for counters but the issue was
raised that this will require a large number of filesystem calls when, for
example, a user wants to move a group of counters associated with the events
of one set of monitoring groups to another set of monitoring groups. This
is for the use case where there are a significant number of monitor groups
for which there are not sufficient counters. With mbm_assign_control this
can be done in a single write and such a monitoring transition can thus
be accomplished more efficiently.

> 
> Alternatively, if we want to be able to expand beyond single letters,
> could we reserve one or more characters for extension purposes?
> 
> If braces are forbidden by the syntax today, could we add support for
> something like the following later on, without breaking anything?
> 
> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 

Thank you for the suggestion. I think we may need something like this.
Babu, what do you think?

> 
> For now, my main concern would be whether this series prevents that
> sort of thing being added in a backwards compatible way later.
> 
> I don't really see anything that is a blocker.
> 
> What do you think?

I do not fully understand the MPAM counter feature. It almost sounds like
every counter could be configured independently with the expectation to
configure and assign each counter independently to a domain. As I understand
these capabilities match AMD's ABMC feature, but the planned implementation
to support ABMC first configures events per-domain and then assign these
events to counters. hmmm ... but in your example a file like
"mbm_counter0_bytes_type" is global. Could you please elaborate how in
your example writing a single letter to that file will be interpreted?


Reinette

[1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@xxxxxxxxx/
[2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@xxxxxxxxxxxxxx/




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux