Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dave/Reinette,

On 2/14/2025 12:26 AM, Reinette Chatre wrote:
Hi Dave,

On 2/13/25 9:37 AM, Dave Martin wrote:
Hi Reinette,

On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
Hi Dave,

On 2/12/25 9:46 AM, Dave Martin wrote:
Hi there,

On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:

This series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

Series is written such that it is easier to support other assignable
features supported from different vendors.

The feature details are documented in the  APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'

[...]

b. Check how many ABMC counters are available.

	# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
	32

Is this file needed?

With MPAM, it is more difficult to promise that the same number of
counters will be available everywhere.

Rather than lie, or report a "safe" value here that may waste some
counters, can we just allow the number of counters to be be discovered
per domain via available_mbm_cntrs?

This sounds reasonable to me. I think us having trouble with the
user documentation of this file so late in development should also have been
a sign to rethink its value.

For a user to discover the number of counters supported via available_mbm_cntrs
would require the file's contents to be captured right after mount. Since we've
had scenarios where new userspace needs to discover an up-and-running system's
configuration this may not be possible. I thus wonder instead of removing
num_mbm_cntrs, it could be modified to return the per-domain supported counters
instead of a single value?

Is it actually useful to be able to discover the number of counters
that exist?  A counter that exists but is not available cannot be used,
so perhaps it is not useful to know about it in the first place.

An alternative perspective of what "available" means is "how many counters
could I possibly get to do this new monitoring task". A user may be willing
to re-assign counters if the new monitoring task is important. Knowing
how many counters are already free and available for assignment would be
easy from available_mbm_cntrs but to get an idea of how many counters
could be re-assigned to help out with the new task would require
some intricate parsing of mbm_assign_control.


But if we keep this file but make it report the number of counters for
each domain (similarly to mbm_available_cntrs), then I think the MPAM
driver should be able to work with that.

num_closids and num_rmids are already problematic for MPAM, so it would
be good to avoid any more parameters of this sort from being reported
to userspace unless there is a clear understanding of why they are
needed.

Yes. Appreciate your help in identifying what could be problematic for MPAM.

For clarity: this is a background issue, mostly orthogonal to this
series.

If this series is merged as-is, with a global per-resource
num_mbm_cntrs property, then this not really worse than the current
situation -- it's just a bit annoying from the MPAM perspective.


In a nutshell, the num_closids / num_rmids parameters seem to expose
RDT-specific hardware semantics to userspace, implying a specific
allocation model for control group and monitoring group identifiers.

The guarantees that userspace is entitled to asssume when resctrl
reports particular values do not seem to be well described and are hard
to map onto the nearest-equivalent MPAM implementation.  A combination
of control and monitoring groups that can be created on x86 may not be
creatable on MPAM, even when the number of supportable control and
monitoring partitions is the same.

I understand. This interface was created almost a decade ago. It would have been
wonderful if the user interface could have been created with a clear vision
of all the use cases it would end up needing to support. I am trying to be
very careful with this new user interface as I try to consider all the things I
learned while working on resctrl. All help get this new interface right is
greatly appreciated.

Since your specifically mention issues that MPAM has with num_rmids, please
note that we have been trying (see [1], but maybe start reading thread at [2])
to find ways to make this work with MPAM but no word from MPAM side.
I see that you were not cc'd on the discussion so this is not a criticism of
you personally but I would like to highlight that we do try to make things
work well for MPAM but so far this work seems ignored, yet critisized
for not being done. I expect the more use cases are thrown at an interface
as it is developed the better it would get and I would gladly work with MPAM
folks to improve things.

Even with the ABMC series, we may still be constrained on what we can
report for num_rmids: we can't know in advance whether or not the user
is going to use mbm_cntr_assign mode -- if not, we can't promise to
create more monitoring groups than the number of counters in the
hardware.

It is the architecture that decides which modes are supported and
which is default.

It seems natural for the counts reported by "available_mbm_cntrs" to
change dynamically when the ABMC assignment mode is changed, but I
think userspace are likely to expect the global "num_rmids" parameters
to be fixed for the lifetime of the resctrl mount (and possibly fixed
for all time on a given hardware platform -- at least, modulo CDP).


I think it might be possible to tighten up the docmentation of
num_closids in particular in a way that doesn't conflict with x86 and
may make it easier for MPAM to fit in with, but that feels like a
separate conversation.

None of this should be considered a blocker for this series, either way.


Reporting number of counters per monitoring domain is a more natural
fit for MPAM, as below:

c. Check how many ABMC counters are available in each domain.

	# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
	0=30;1=30

For MPAM, this seems supportable.  Each monitoring domain will have
some counters, and a well-defined number of them will be available for
allocation at any one time.

[...]

e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control

[...]

        Flags can be one of the following:

         t  MBM total event is enabled.
         l  MBM local event is enabled.
         tl Both total and local MBM events are enabled.
         _  None of the MBM events are enabled

	Examples:

[...]

I think that this basically works for MPAM.

The local/total distinction doesn't map in a consistent way onto MPAM,
but this problem is not specific to ABMC.  It feels sensible for ABMC
to be built around the same concepts that resctrl already has elsewhere
in the interface.  MPAM will do its best to fit (as already).

Regarding Peter's use case of assiging multiple counters to a
monitoring group [1], I feel that it's probably good enough to make
sure that the ABMC interface can be extended in future in a backwards
compatible way so as to support this, without trying to support it
immediately.

[1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@xxxxxxxxxxxxxx/


I do not think that resctrl's current support of the mbm_total_bytes and
mbm_local_bytes should be considered as the "only" two available "slots"
into which all possible events should be forced into. "mon_features" exists
to guide user space to which events are supported and as I see it new events
can be listed here to inform user space of their availability, with their
associated event files available in the resource groups.

That's fair.  I wasn't currently sure how (or if) the set of countable
events was expected to grow / evolve via this route.

Either way, I think this confirms that there is at least one viable way
to enable more counters for a single control group, on top of this
series.

(If there is more than one way, that seems fine?)


For example, if we added new generic "letters" -- say, "0" to "9",
combined with new counter files in resctrlfs, that feels like a
possible approach.  ABMC (as in this series) should just reject such
such assignments, and the new counter files wouldn't exist.

Availability of this feature could also be reported as a distinct mode
in mbm_assign_mode, say "mbm_cntr_generic", or whatever.


A _sketch_ of this follows.  This is NOT a proposal -- the key
question is whether we are confident that we can extend the interface
in this way in the future without breaking anything.

If "yes", then the ABMC interface (as proposed by this series) works as
a foundation to build on.

--8<--

[artists's impression]

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
  	mbm_cntr_generic
  	[mbm_cntr_assign]
  	default

# echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
# echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
# echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type
# echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type
# echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type
# echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type

...

# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes

etc.


It is not clear to me what additional features such an interface enables. It
also looks like user space will need to track and manage counter IDs?

My idea was that for these generic counters, new files could be exposed
to configure what they actually count (the ..._type files shown above;
or possibly via the ..._config files that already exist).

The "IDs" were inteded as abstract; the number only relates the
assignments in mbm_assign_control to the files created elsewhere.  This
wouldn't be related to IDs assigned by the hardware.

I see. Yes, this sounds related to and a generalization of the AMD
configurable event feature.


If there are multiple resctrl users then using numeric IDs might be
problematic; though if we go eventually in the direction of making
resctrlfs multi-mountable then each mount could have its own namespace.

I am not aware of "multi-mountable" direction.


Allowing counters to be named and configured with a mkdir()-style
interface might be possible too; that might make it easier for users to
coexist within a single resctrl mount (if we think that's important
enough).

It sounds to me as though the issue starts with your statement
"The local/total distinction doesn't map in a consistent way onto MPAM". To
address this I expect that an MPAM system will not support nor list
mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
it would list the events that are appropriate to the system? Trying to match
with what Peter said [1] in the message you refer to, this may be possible:

# cat /sys/fs/resctrl/info/L3_MON/mon_features
mbm_local_read_bytes
mbm_local_write_bytes
mbm_local_bytes

(*) I am including mbm_local_bytes since it could be an event that can be software
defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
counted.

I see the support for MPAM events distinct from the support of assignable counters.
Once the MPAM events are sorted, I think that they can be assigned with existing interface.
Please help me understand if you see it differently.
	
Doing so would need to come up with alphabetical letters for these events,
which seems to be needed for your proposal also? If we use possible flags of:

mbm_local_read_bytes a
mbm_local_write_bytes b

Then mbm_assign_control can be used as:
# echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
<value>
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
<sum of mbm_local_read_bytes and mbm_local_write_bytes>

One issue would be when resctrl needs to support more than 26 events (no more flags available),
assuming that upper case would be used for "shared" counters (unless this interface is defined
differently and only few uppercase letters used for it). Would this be too low of a limit?

Reinette

[1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@xxxxxxxxxxxxxx/

That approach would also work, where an MPAM system has events are not
a reasonable approximation of the generic "total" or "local".

For now we would probably stick with "total" and "local" anyway though,
because the MPAM architecture doesn't natively allow the mapping onto
the memory system topology to be discovered, and the information in
ACPI / device tree is insufficient to tell us everything we'd need to
know.  But I guess what counts as "local" in particular will be quite
hardware and topology dependent even on x86, so perhaps we shouldn't
worry about having the behaviour match exactly (?)

Regarding the code letters, my idea was that the event type might be
configured by a separate file, instead of in mbm_assign_control
directly, in which case running out of letters wouldn't be a problem.

This work started with individual files for counters but the issue was
raised that this will require a large number of filesystem calls when, for
example, a user wants to move a group of counters associated with the events
of one set of monitoring groups to another set of monitoring groups. This
is for the use case where there are a significant number of monitor groups
for which there are not sufficient counters. With mbm_assign_control this
can be done in a single write and such a monitoring transition can thus
be accomplished more efficiently.


Alternatively, if we want to be able to expand beyond single letters,
could we reserve one or more characters for extension purposes?

If braces are forbidden by the syntax today, could we add support for
something like the following later on, without breaking anything?

# echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control


Thank you for the suggestion. I think we may need something like this.
Babu, what do you think?

I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
It is a random text?

In his example from
https://lore.kernel.org/lkml/Z643WdXYARTADSBy@xxxxxxxxxxxxxxx/
--------------------------------------------------------------
The numbers are not supposed to have an hardware significance.

	'//0=6'

just "means assign some unused counter for domain 0, and create files
in resctrl so I can configure and read it".

The "6" is really just a tag for labelling the resulting resctrl
file names so that the user can tell them apart.  It's not supposed
to imply any specific hardware counter or event.
------------------------------------------------------------------

It seems that 'foo' and 'bar' are tags used to create files in /sys/fs/resctrl/info/L3_MON/.

Given that, it looks like we're discussing entirely different things.



For now, my main concern would be whether this series prevents that
sort of thing being added in a backwards compatible way later.

I don't really see anything that is a blocker.

What do you think?

I do not fully understand the MPAM counter feature. It almost sounds like
every counter could be configured independently with the expectation to
configure and assign each counter independently to a domain. As I understand
these capabilities match AMD's ABMC feature, but the planned implementation
to support ABMC first configures events per-domain and then assign these
events to counters. hmmm ... but in your example a file like
"mbm_counter0_bytes_type" is global. Could you please elaborate how in
your example writing a single letter to that file will be interpreted?


Reinette

[1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@xxxxxxxxx/
[2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@xxxxxxxxxxxxxx/






[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux