[AMD Official Use Only - General] Hi Reinette, > -----Original Message----- > From: Reinette Chatre <reinette.chatre@xxxxxxxxx> > Sent: Thursday, December 15, 2022 12:25 PM > To: Moger, Babu <Babu.Moger@xxxxxxx>; corbet@xxxxxxx; > tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; bp@xxxxxxxxx > Cc: fenghua.yu@xxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; > hpa@xxxxxxxxx; paulmck@xxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > quic_neeraju@xxxxxxxxxxx; rdunlap@xxxxxxxxxxxxx; > damien.lemoal@xxxxxxxxxxxxxxxxxx; songmuchun@xxxxxxxxxxxxx; > peterz@xxxxxxxxxxxxx; jpoimboe@xxxxxxxxxx; pbonzini@xxxxxxxxxx; > chang.seok.bae@xxxxxxxxx; pawan.kumar.gupta@xxxxxxxxxxxxxxx; > jmattson@xxxxxxxxxx; daniel.sneddon@xxxxxxxxxxxxxxx; Das1, Sandipan > <Sandipan.Das@xxxxxxx>; tony.luck@xxxxxxxxx; james.morse@xxxxxxx; > linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > bagasdotme@xxxxxxxxx; eranian@xxxxxxxxxx; christophe.leroy@xxxxxxxxxx; > jarkko@xxxxxxxxxx; adrian.hunter@xxxxxxxxx; quic_jiles@xxxxxxxxxxx; > peternewman@xxxxxxxxxx > Subject: Re: [PATCH v9 10/13] x86/resctrl: Add sysfs interface to write > mbm_total_bytes_config > > Hi Babu, > > On 12/1/2022 7:37 AM, Babu Moger wrote: > > The current event configuration for mbm_total_bytes can be changed by > > the user by writing to the file > > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config. > > Please drop "current" from above > > > > > The event configuration settings are domain specific and will affect > > all the CPUs in the domain. > > please drop "will" > > > > > Following are the types of events supported: > > > > ==== > =========================================================== > > Bits Description > > ==== > =========================================================== > > 6 Dirty Victims from the QOS domain to all types of memory > > 5 Reads to slow memory in the non-local NUMA domain > > 4 Reads to slow memory in the local NUMA domain > > 3 Non-temporal writes to non-local NUMA domain > > 2 Non-temporal writes to local NUMA domain > > 1 Reads to memory in the non-local NUMA domain > > 0 Reads to memory in the local NUMA domain > > ==== > =========================================================== > > > > For example: > > To change the mbm_total_bytes to count only reads on domain 0, the > > bits 0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33). > > Run the command. > > $echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config > > > > To change the mbm_total_bytes to count all the slow memory reads on > > domain 1, the bits 4 and 5 needs to be set which is 110000b (in hex 0x30). > > Run the command. > > $echo 1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config > > > > Signed-off-by: Babu Moger <babu.moger@xxxxxxx> > > --- > > arch/x86/kernel/cpu/resctrl/monitor.c | 13 +++ > > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 127 > ++++++++++++++++++++++++++++++++ > > include/linux/resctrl.h | 10 +++ > > 3 files changed, 149 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c > > b/arch/x86/kernel/cpu/resctrl/monitor.c > > index 7c8a3a745041..b265856835de 100644 > > --- a/arch/x86/kernel/cpu/resctrl/monitor.c > > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c > > @@ -176,6 +176,19 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, > struct rdt_domain *d, > > memset(am, 0, sizeof(*am)); > > } > > > > +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct > > +rdt_domain *d) { > > + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); > > + > > + if (is_mbm_total_enabled()) > > + memset(hw_dom->arch_mbm_total, 0, > > + sizeof(*hw_dom->arch_mbm_total) * r->num_rmid); > > + > > + if (is_mbm_local_enabled()) > > + memset(hw_dom->arch_mbm_local, 0, > > + sizeof(*hw_dom->arch_mbm_local) * r->num_rmid); } > > + > > We learned a lot more about this area after Peter's discovery: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kern > el.org%2Flkml%2F20221207112924.3602960-1- > peternewman%40google.com%2F&data=05%7C01%7Cbabu.moger%40am > d.com%7Cc9ca23a7b3c643d2385508dadec9f83d%7C3dd8961fe4884e608e11a8 > 2d994e183d%7C0%7C0%7C638067256273943013%7CUnknown%7CTWFpbGZsb > 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0% > 3D%7C3000%7C%7C%7C&sdata=Pmvi2L4L727GqALGaeEO0MbpcnuRcqKc > opNNXuu%2BbFM%3D&reserved=0 > > Since this is a new generic function it should be clear in which scenarios it is > valid. > Could you please add a function comment that warns future developers about > consequences if a new usage is considered? Something like: > > /* > * Assumes that hardware counters are also reset and thus that there is no > need > * to record initial non-zero counts. > */ > > > static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int > > width) { > > u64 shift = 64 - width, chunks; > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > > b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > > index 580f3cce19e2..8a22a652a6e8 100644 > > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > > @@ -1517,6 +1517,130 @@ static int mbm_local_bytes_config_show(struct > kernfs_open_file *of, > > return 0; > > } > > > > +static void mon_event_config_write(void *info) { > > + struct mon_config_info *mon_info = info; > > + u32 index; > > + > > index does not need to be u32 ... mon_event_config_index_get() returns > "unsigned int" > and wrmsr expects "unsigned int", it can also just be "unsigned int". > > > > + index = mon_event_config_index_get(mon_info->evtid); > > + if (index == INVALID_CONFIG_INDEX) { > > + pr_warn_once("Invalid event id %d\n", mon_info->evtid); > > + return; > > + } > > + wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); > } > > + > > +static int mbm_config_write_domain(struct rdt_resource *r, > > + struct rdt_domain *d, u32 evtid, u32 val) { > > + struct mon_config_info mon_info = {0}; > > + int ret = 0; > > + > > + /* mon_config cannot be more than the supported set of events */ > > + if (val > MAX_EVT_CONFIG_BITS) { > > + rdt_last_cmd_puts("Invalid event configuration\n"); > > + return -EINVAL; > > + } > > + > > + /* > > + * Read the current config value first. If both are the same then > > + * no need to write it again. > > + */ > > + mon_info.evtid = evtid; > > + mondata_config_read(d, &mon_info); > > + if (mon_info.mon_config == val) > > + goto out; > > + > > + mon_info.mon_config = val; > > + > > + /* > > + * Update MSR_IA32_EVT_CFG_BASE MSRs on all the CPUs in the > > + * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE > > + * are scoped at the domain level. Writing any of these MSRs > > + * on one CPU is supposed to be observed by all CPUs in the > > + * domain. However, the hardware team recommends to update > > + * these MSRs on all the CPUs in the domain. > > + */ > > + on_each_cpu_mask(&d->cpu_mask, mon_event_config_write, > &mon_info, > > +1); Forgot about this. This snippet is going to change. I have tested and works fine. How about this? /* * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE * are scoped at the domain level. Writing any of these MSRs * on one CPU is supposed to be observed by all CPUs in the domain. */ smp_call_function_any(&d->cpu_mask, mon_event_config_write, &mon_info, 1); /* > + * Update MSR_IA32_EVT_CFG_BASE MSRs on all the CPUs in the > + * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE > + * are scoped at the domain level. Writing any of these MSRs > + * on one CPU is supposed to be observed by all CPUs in the > + * domain. However, the hardware team recommends to update > + * these MSRs on all the CPUs in the domain. > + */ > + on_each_cpu_mask(&d->cpu_mask, mon_event_config_write, &mon_info, 1); > > + > > + /* > > + * When an Event Configuration is changed, the bandwidth counters > > + * for all RMIDs and Events will be cleared by the hardware. The > > + * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for > > + * every RMID on the next read to any event for every RMID. > > + * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) > > + * cleared while it is tracked by the hardware. Clear the > > + * mbm_local and mbm_total counts for all the RMIDs. > > + */ > > + resctrl_arch_reset_rmid_all(r, d); > > If I understand correctly the expectation is that when user space read counters > (via mon_data files) right after the configuration was changed then this read > will return "Unavailable" and then the next read will return data. > > If this is the case then I think a snippet about this user experience would be > helpful to add to the documentation. > > Have you considered doing a preemptive read on the RMIDs that are in use to > avoid users encountering "Unavailable"? I assume doing so on a busy system > could potentially involve hundreds of register reads/writes. > > > + > > +out: > > + return ret; > > +} > > + > > +static int mon_config_write(struct rdt_resource *r, char *tok, u32 > > +evtid) { > > + char *dom_str = NULL, *id_str; > > + unsigned long dom_id, val; > > + struct rdt_domain *d; > > + int ret = 0; > > + > > +next: > > + if (!tok || tok[0] == '\0') > > + return 0; > > + > > + /* Start processing the strings for each domain */ > > + dom_str = strim(strsep(&tok, ";")); > > + id_str = strsep(&dom_str, "="); > > + > > + if (!id_str || kstrtoul(id_str, 10, &dom_id)) { > > + rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); > > + return -EINVAL; > > + } > > + > > + if (!dom_str || kstrtoul(dom_str, 16, &val)) { > > + rdt_last_cmd_puts("Non-numeric event configuration > value\n"); > > + return -EINVAL; > > + } > > + > > + list_for_each_entry(d, &r->domains, list) { > > + if (d->id == dom_id) { > > + ret = mbm_config_write_domain(r, d, evtid, val); > > + if (ret) > > + return -EINVAL; > > + goto next; > > + } > > + } > > + > > + return -EINVAL; > > +} > > + > > +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, > > + char *buf, size_t nbytes, > > + loff_t off) > > +{ > > + struct rdt_resource *r = of->kn->parent->priv; > > + int ret; > > + > > + /* Valid input requires a trailing newline */ > > + if (nbytes == 0 || buf[nbytes - 1] != '\n') > > + return -EINVAL; > > + > > + cpus_read_lock(); > > + mutex_lock(&rdtgroup_mutex); > > + > > + rdt_last_cmd_clear(); > > + > > + buf[nbytes - 1] = '\0'; > > + > > + ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); > > + > > + mutex_unlock(&rdtgroup_mutex); > > + cpus_read_unlock(); > > + > > + return ret ?: nbytes; > > +} > > + > > /* rdtgroup information files for one cache resource. */ static > > struct rftype res_common_files[] = { > > { > > @@ -1617,9 +1741,10 @@ static struct rftype res_common_files[] = { > > }, > > { > > .name = "mbm_total_bytes_config", > > - .mode = 0444, > > + .mode = 0644, > > .kf_ops = &rdtgroup_kf_single_ops, > > .seq_show = mbm_total_bytes_config_show, > > + .write = mbm_total_bytes_config_write, > > }, > > { > > .name = "mbm_local_bytes_config", > > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index > > 0cee154abc9f..e4dc65892446 100644 > > --- a/include/linux/resctrl.h > > +++ b/include/linux/resctrl.h > > @@ -250,6 +250,16 @@ int resctrl_arch_rmid_read(struct rdt_resource > > *r, struct rdt_domain *d, void resctrl_arch_reset_rmid(struct rdt_resource > *r, struct rdt_domain *d, > > u32 rmid, enum resctrl_event_id eventid); > > > > +/** > > + * resctrl_arch_reset_rmid_all() - Reset any private state associated with > > + * all the rmids. > > It could be more explicit: > "Reset all private state associated with all rmids and eventids." > > > + * @r: The domain's resource. > > + * @d: The rmid's domain. > > This copy&paste needs some changes to match this new utility. > How about: > @r: The resctrl resource. > @d: The domain for which all architectural counter state will be cleared. > > I think it can be improved more but the above could be a start (please do not > copy verbatim but ensure style is correct.) > > Keep in mind that this utility does not clear the non-architectural counter state. > This does not apply to AMD since that state is used by the software controller, > but it needs to be kept in mind if another usage for this utility arises. > > > + * > > + * This can be called from any CPU. > > + */ > > +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct > > +rdt_domain *d); > > + > > extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned > > int resctrl_rmid_realloc_limit; > > > > > > > > The above hunk fails the "no spaces before tabs" checkpatch check. > > Reinette