Hi Tony, On 1/30/24 16:20, Tony Luck wrote: > Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores > and memory controllers on a socket into two or more groups. These are > presented to the operating system as NUMA nodes. > > This may enable some workloads to have slightly lower latency to memory > as the memory controller(s) in an SNC node are electrically closer to the > CPU cores on that SNC node. This cost may be offset by lower bandwidth > since the memory accesses for each core can only be interleaved between > the memory controllers on the same SNC node. > > Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks > to track L3 cache occupancy and memory bandwidth. There is an MSR that > controls how the RMIDs are shared between SNC nodes. > > The default mode divides them numerically. E.g. when there are two SNC > nodes on a socket the lower number half of the RMIDs are given to the > first node, the remainder to the second node. This would be difficult > to use with the Linux resctrl interface as specific RMID values assigned > to resctrl groups are not visible to users. > > The other mode divides the RMIDs and renumbers the ones on the second > SNC node to start from zero. > > Even with this renumbering SNC mode requires several changes in resctrl > behavior for correct operation. > > Add a global integer "snc_nodes_per_l3_cache" that shows how many > SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1", > SNC mode is either not implemented, or not enabled. > > Update all places to take appropriate action when SNC mode is enabled: > 1) The number of logical RMIDs per L3 cache available for use is the > number of physical RMIDs divided by the number of SNC nodes. > 2) Likewise the "mon_scale" value must be divided by the number of SNC > nodes. > 3) The RMID renumbering operates when using the value from the > IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID > counter, adjust from the logical RMID to the physical > RMID value for the SNC node that it wishes to read and load the > adjusted value into the IA32_QM_EVTSEL MSR. > 4) Divide the L3 cache between the SNC nodes. Divide the value > reported in the resctrl "size" file by the number of SNC > nodes because the effective amount of cache that can be allocated > is reduced by that factor. > 5) Disable the "-o mba_MBps" mount option in SNC mode > because the monitoring is being done per SNC node, while the > bandwidth allocation is still done at the L3 cache scope. > Trying to use this feedback loop might result in contradictory > changes to the throttling level coming from each of the SNC > node bandwidth measurements. > > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> > --- > arch/x86/kernel/cpu/resctrl/internal.h | 2 ++ > arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++ > arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++--- > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++-- > 4 files changed, 24 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h > index c6051bc70e96..d9c6dcf30922 100644 > --- a/arch/x86/kernel/cpu/resctrl/internal.h > +++ b/arch/x86/kernel/cpu/resctrl/internal.h > @@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); > > extern struct dentry *debugfs_resctrl; > > +extern unsigned int snc_nodes_per_l3_cache; I feel this can be part of rdt_resource instead of global. > + > enum resctrl_res_level { > RDT_RESOURCE_L3_MON, > RDT_RESOURCE_L3, > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c > index b741cbf61843..dc886d2c9a33 100644 > --- a/arch/x86/kernel/cpu/resctrl/core.c > +++ b/arch/x86/kernel/cpu/resctrl/core.c > @@ -48,6 +48,12 @@ int max_name_width, max_data_width; > */ > bool rdt_alloc_capable; > > +/* > + * Number of SNC nodes that share each L3 cache. Default is 1 for > + * systems that do not support SNC, or have SNC disabled. > + */ > +unsigned int snc_nodes_per_l3_cache = 1; > + > static void > mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m, > struct rdt_resource *r); > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c > index 080cad0d7288..357919bbadbe 100644 > --- a/arch/x86/kernel/cpu/resctrl/monitor.c > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c > @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid) > > static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) > { > + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; RDT_RESOURCE_L3_MON? > + int cpu = smp_processor_id(); > + int rmid_offset = 0; > u64 msr_val; > > + /* > + * When SNC mode is on, need to compute the offset to read the > + * physical RMID counter for the node to which this CPU belongs. > + */ > + if (snc_nodes_per_l3_cache > 1) > + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid; Not sure if you have tested or not. r->num_rmid is initialized for the resource RDT_RESOURCE_L3_MON. For other resource it is always 0. > + > /* > * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured > * with a valid event code for supported resource type and the bits > @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) > * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62) > * are error bits. > */ > - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); > + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset); > rdmsrl(MSR_IA32_QM_CTR, msr_val); > > if (msr_val & RMID_VAL_ERROR) > @@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) > int ret; > > resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; > - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale; > - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1; > + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache; > + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache; > hw_res->mbm_width = MBM_CNTR_WIDTH_BASE; > > if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX) > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index 770f2bf98462..e639069f871a 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, > } > } > > - return size; > + return size / snc_nodes_per_l3_cache; > } > > /* > @@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void) > struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; > > return (is_mbm_local_enabled() && > - r->alloc_capable && is_mba_linear()); > + r->alloc_capable && is_mba_linear() && > + snc_nodes_per_l3_cache == 1); > } > > /* -- Thanks Babu Moger