Re: [PATCH v9 14/26] x86/resctrl: Introduce interface to display number of free counters

"Moger, Babu" <bmoger@xxxxxxx> · Thu, 28 Nov 2024 13:35:09 -0600

Hi Peter,

On 11/28/2024 5:10 AM, Peter Newman wrote:
Hi Babu, Reinette,

On Wed, Nov 27, 2024 at 8:05 PM Reinette Chatre
<reinette.chatre@xxxxxxxxx> wrote:

Hi Babu,

On 11/27/24 6:57 AM, Moger, Babu wrote:
Hi Reinette,

On 11/26/2024 5:56 PM, Reinette Chatre wrote:
Hi Babu,

On 11/26/24 3:31 PM, Moger, Babu wrote:
On 11/25/2024 1:00 PM, Reinette Chatre wrote:
On 11/22/24 3:36 PM, Moger, Babu wrote:
On 11/21/2024 3:12 PM, Reinette Chatre wrote:
On 11/19/24 11:20 AM, Moger, Babu wrote:
On 11/15/24 18:31, Reinette Chatre wrote:
On 10/29/24 4:21 PM, Babu Moger wrote:
Provide the interface to display the number of free monitoring counters
available for assignment in each doamin when mbm_cntr_assign is supported.

Signed-off-by: Babu Moger <babu.moger@xxxxxxx>
---
v9: New patch.
---
     Documentation/arch/x86/resctrl.rst     |  4 ++++
     arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
     arch/x86/kernel/cpu/resctrl/rdtgroup.c | 33 ++++++++++++++++++++++++++
     3 files changed, 38 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 2f3a86278e84..2bc58d974934 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -302,6 +302,10 @@ with the following files:
         memory bandwidth tracking to a single memory bandwidth event per
         monitoring group.
     +"available_mbm_cntrs":
+    The number of free monitoring counters available assignment in each domain

"The number of free monitoring counters available assignment" -> "The number of monitoring
counters available for assignment"?

(not taking into account how text may change after addressing Peter's feedback)

How about this?

"The number of monitoring counters available for assignment in each domain
when the architecture supports mbm_cntr_assign mode. There are a total of
"num_mbm_cntrs" counters are available for assignment. Counters can be
assigned or unassigned individually in each domain. A counter is available
for new assignment if it is unassigned in all domains."

Please consider the context of this paragraph. It follows right after the description
of "num_mbm_cntrs" that states "Up to two counters can be assigned per monitoring group".
I think it is confusing to follow that with a paragraph that states "Counters can be
assigned or unassigned individually in each domain." I wonder if it may be helpful to
use a different term ... for example a counter is *assigned* to an event of a monitoring
group but this assignment may be to specified (not yet supported) or all (this work) domains while
it is only *programmed*/*activated* to specified domains. Of course, all of this documentation
needs to remain coherent if future work decides to indeed support per-domain assignment.


Little bit lost here. Please help me.

I think this highlights the uncertainty this interface brings. How do you expect users
to use this interface? At this time I think this interface can create a lot of confusion.
For example, consider a hypothetical system with three domains and four counters that
has the following state per mbm_assign_control:

//0=tl;1=_;2=l #default group uses counters 0 and 1 to monitor total and local MBM
/m1/0=_;1=t;2=t #monitor group m1 uses counter 2, just for total MBM
/m2/0=l;1=_;2=l #monitor group m2 uses counter 3, just for local MBM
/m3/0=_;1=_;2=_

Since, in this system there are only four counters available, and
they have all been assigned, then there are no new counters available for
assignment.

If I understand correctly, available_mbm_cntrs will read:
0=1;1=3;2=1

Yes. Exactly. This causes confusion to the user.

How is a user to interpret the above numbers? It does not reflect
that no counter can be assigned to m3, instead it reflects which of the
already assigned counters still need to be activated on domains.
If, for example, a user is expected to use this file to know how
many counters can still be assigned, should it not reflect the actual
available counters. In the above scenario it will then be:
0=0;1=0;2=0

We can also just print
#cat available_mbm_cntrs
0

The domain specific information is not important here.
That was my original idea. We can go back to that definition. That is more clear to the user.

Tony's response [1] still applies.

I believe Tony's suggestion [2] considered that the available counters will be the
same for every domain for this implementation. That is why my example noted:
"0=0;1=0;2=0"

yes. We can keep it like this.


The confusion surrounding the global allocator seems to be prevalent ([3], [4]) as folks
familiar with resctrl attempt to digest the work. The struggle to make this documentation clear
makes me more concerned how this feature will be perceived by users who are not as familiar with
resctrl internals. I think that it may be worth it to take a moment and investigate what it will take
to implement a per-domain counter allocator. The hardware supports it and I suspect that the upfront
work to do the enabling will make it easier for users to adopt and understand the feature.

What do you think?

It adds more complexity for sure.

I do see a difference in data structures used but the additional complexity is not
obvious to me. It seems like there will be one fewer data structure, the
global bitmap, and I think that will actually bring with it some simplification since
there is no longer the need to coordinate between the per-domain and global counters,
for example the logic that only frees a global counter if it is no longer used by a domain.

This may also simplify the update of the monitor event config (BMEC) since it can be
done directly on counters of the domain instead of needing to go back and forth between
global and per-domain counters.


1. Each group needs to remember counter ids in each domain for each event.
    For example:
    Resctrl group mon1
     Total event
     dom 0 cntr_id 1,
     dom 1 cntr_id 10
     dom 2 cntr_id 11

    Local event
     dom 0 cntr_id 2,
     dom 1 cntr_id 15
     dom 2 cntr_id 10

Indeed. The challenge here is that domains may come and go so it cannot be a simple
static array. As an alternative it can be an xarray indexed by the domain ID with
pointers to a struct like below to contain the counters associated with the monitor
group:
         struct cntr_id {
                 u32     mbm_total;
                 u32     mbm_local;
         }


Thinking more about how this array needs to be managed made me wonder how the
current implementation deals with domains that come and go. I do not think
this is currently handled. For example, if a new domain comes online and
monitoring groups had counters dynamically assigned, then these counters are
not configured to the newly online domain.

I am trying to understand the details of your approach here.

In my prototype, I allocated a counter id-indexed array to each
monitoring domain structure for tracking the counter allocations,
because the hardware counters are all domain-scoped. That way the
tracking data goes away when the hardware does.

I was focused on allowing all pending counter updates to a domain
resulting from a single mbm_assign_control write to be batched and
processed in a single IPI, so I structured the counter tracker
something like this:

Not sure what you meant here. How are you batching two IPIs for two domains?

#echo "//0=t;1=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

This is still a single write. Two IPIs are sent separately, one for each 
domain.

Are you doing something different?


struct resctrl_monitor_cfg {
     int closid;
     int rmid;
     int evtid;
     bool dirty;
};

This mirrors the info needed in whatever register configures the
counter, plus a dirty flag to skip over the ones that don't need to be
updated.

This is what my understanding of your implementation.

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d94abba1c716..9cebf065cc97 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -94,6 +94,13 @@ struct rdt_ctrl_domain {
        u32                             *mbps_val;
 };

+struct resctrl_monitor_cfg {
+    int closid;
+    int rmid;
+    int evtid;
+    bool dirty;
+};
+
 /**
  * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor 
resource
  * @hdr:               common header for different domain types
@@ -116,6 +123,7 @@ struct rdt_mon_domain {
        struct delayed_work             cqm_limbo;
        int                             mbm_work_cpu;
        int                             cqm_work_cpu;
+     /* Allocate num_mbm_cntrs entries in each domain */
+       struct resctrl_monitor_cfg      *mon_cfg;
 };


When a user requests an assignment for total event to the default group 
for domain 0, you go search in rdt_mon_domain(dom 0) for empty mon_cfg 
entry.

If there is an empty entry, then use that entry for assignment and 
update closid, rmid, evtid and dirty = 1. We can get all these 
information from default group here.

Does this make sense?


For the benefit of displaying mbm_assign_control, I put a pointer back
to any counter array entry allocated in the mbm_state struct only
because it's an existing structure that exists for every rmid-domain
combination.

Pointer in mbm_state may not be required here.

We are going to loop over resctrl groups. We can search the 
rdt_mon_domain to see if specific closid, rmid, evtid is already 
assigned or not in that domain.


I didn't need to change the rdtgroup structure.

Ok. That is good.





2. We should have a bitmap of "available counters" in each domain. We have
this already. But allocation method changes.

Would allocation/free not be simpler with only the per-domain bitmap needing
to be consulted?

One implementation change I can think of is the dynamic assign of counters when
a monitor group is created. Now a free counter needs to be found in each
domain. Here it can be discussed if it should be an "all or nothing"
assignment but the handling does not seem to be complex and would need to be
solved eventually anyway.

3. Dynamic allocation/free of the counters

There could be some more things which I can't think right now. It might
come up when we start working on it.

It is doable. But, is it worth adding this complexity? I am not sure.

Please elaborate where you see that this is too complex.


Peter mentioned earlier that he was not interested in domain specific
assignments. He was only interested in all domain ("*") implementation.

Peter's most recent message indicates otherwise:
https://lore.kernel.org/all/CALPaoCgiHEaY_cDbCo=537JJ7mkYZDFFDs9heYvtQ80fXuuvWQ@xxxxxxxxxxxxxx/

For now, I'm focused on managing the domains locally whenever possible
to avoid all IPIs, as this gives me the least overhead.

I'm also prototyping the 'T' vs 't' approach that Reinette
suggested[1], as this may take a lot of performance pressure off the
mbm_assign_control interface, as most of the routine updates to
counter assignments would be automated.

-Peter

[1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@xxxxxxxxx/


--
- Babu Moger