RE: [PATCH v9 10/13] x86/resctrl: Add sysfs interface to write mbm_total_bytes_config

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

Hi Reinette,

> -----Original Message-----
> From: Reinette Chatre <reinette.chatre@xxxxxxxxx>
> Sent: Thursday, December 15, 2022 12:25 PM
> To: Moger, Babu <Babu.Moger@xxxxxxx>; corbet@xxxxxxx;
> tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; bp@xxxxxxxxx
> Cc: fenghua.yu@xxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx;
> hpa@xxxxxxxxx; paulmck@xxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> quic_neeraju@xxxxxxxxxxx; rdunlap@xxxxxxxxxxxxx;
> damien.lemoal@xxxxxxxxxxxxxxxxxx; songmuchun@xxxxxxxxxxxxx;
> peterz@xxxxxxxxxxxxx; jpoimboe@xxxxxxxxxx; pbonzini@xxxxxxxxxx;
> chang.seok.bae@xxxxxxxxx; pawan.kumar.gupta@xxxxxxxxxxxxxxx;
> jmattson@xxxxxxxxxx; daniel.sneddon@xxxxxxxxxxxxxxx; Das1, Sandipan
> <Sandipan.Das@xxxxxxx>; tony.luck@xxxxxxxxx; james.morse@xxxxxxx;
> linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> bagasdotme@xxxxxxxxx; eranian@xxxxxxxxxx; christophe.leroy@xxxxxxxxxx;
> jarkko@xxxxxxxxxx; adrian.hunter@xxxxxxxxx; quic_jiles@xxxxxxxxxxx;
> peternewman@xxxxxxxxxx
> Subject: Re: [PATCH v9 10/13] x86/resctrl: Add sysfs interface to write
> mbm_total_bytes_config
> 
> Hi Babu,
> 
> On 12/1/2022 7:37 AM, Babu Moger wrote:
> > The current event configuration for mbm_total_bytes can be changed by
> > the user by writing to the file
> > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.
> 
> Please drop "current" from above
> 
> >
> > The event configuration settings are domain specific and will affect
> > all the CPUs in the domain.
> 
> please drop "will"
> 
> >
> > Following are the types of events supported:
> >
> > ====
> ===========================================================
> > Bits   Description
> > ====
> ===========================================================
> > 6      Dirty Victims from the QOS domain to all types of memory
> > 5      Reads to slow memory in the non-local NUMA domain
> > 4      Reads to slow memory in the local NUMA domain
> > 3      Non-temporal writes to non-local NUMA domain
> > 2      Non-temporal writes to local NUMA domain
> > 1      Reads to memory in the non-local NUMA domain
> > 0      Reads to memory in the local NUMA domain
> > ====
> ===========================================================
> >
> > For example:
> > To change the mbm_total_bytes to count only reads on domain 0, the
> > bits 0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33).
> > Run the command.
> > 	$echo  0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> >
> > To change the mbm_total_bytes to count all the slow memory reads on
> > domain 1, the bits 4 and 5 needs to be set which is 110000b (in hex 0x30).
> > Run the command.
> > 	$echo  1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> >
> > Signed-off-by: Babu Moger <babu.moger@xxxxxxx>
> > ---
> >  arch/x86/kernel/cpu/resctrl/monitor.c  |   13 +++
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  127
> ++++++++++++++++++++++++++++++++
> >  include/linux/resctrl.h                |   10 +++
> >  3 files changed, 149 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> > b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 7c8a3a745041..b265856835de 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -176,6 +176,19 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r,
> struct rdt_domain *d,
> >  		memset(am, 0, sizeof(*am));
> >  }
> >
> > +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct
> > +rdt_domain *d) {
> > +	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> > +
> > +	if (is_mbm_total_enabled())
> > +		memset(hw_dom->arch_mbm_total, 0,
> > +		       sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
> > +
> > +	if (is_mbm_local_enabled())
> > +		memset(hw_dom->arch_mbm_local, 0,
> > +		       sizeof(*hw_dom->arch_mbm_local) * r->num_rmid); }
> > +
> 
> We learned a lot more about this area after Peter's discovery:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kern
> el.org%2Flkml%2F20221207112924.3602960-1-
> peternewman%40google.com%2F&amp;data=05%7C01%7Cbabu.moger%40am
> d.com%7Cc9ca23a7b3c643d2385508dadec9f83d%7C3dd8961fe4884e608e11a8
> 2d994e183d%7C0%7C0%7C638067256273943013%7CUnknown%7CTWFpbGZsb
> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%
> 3D%7C3000%7C%7C%7C&amp;sdata=Pmvi2L4L727GqALGaeEO0MbpcnuRcqKc
> opNNXuu%2BbFM%3D&amp;reserved=0
> 
> Since this is a new generic function it should be clear in which scenarios it is
> valid.
> Could you please add a function comment that warns future developers about
> consequences if a new usage is considered? Something like:
> 
> /*
>  * Assumes that hardware counters are also reset and thus that there is no
> need
>  * to record initial non-zero counts.
>  */
> 
> >  static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int
> > width)  {
> >  	u64 shift = 64 - width, chunks;
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 580f3cce19e2..8a22a652a6e8 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -1517,6 +1517,130 @@ static int mbm_local_bytes_config_show(struct
> kernfs_open_file *of,
> >  	return 0;
> >  }
> >
> > +static void mon_event_config_write(void *info) {
> > +	struct mon_config_info *mon_info = info;
> > +	u32 index;
> > +
> 
> index does not need to be u32 ... mon_event_config_index_get() returns
> "unsigned int"
> and wrmsr expects "unsigned int", it can also just be "unsigned int".
> 
> 
> > +	index = mon_event_config_index_get(mon_info->evtid);
> > +	if (index == INVALID_CONFIG_INDEX) {
> > +		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
> > +		return;
> > +	}
> > +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
> }
> > +
> > +static int mbm_config_write_domain(struct rdt_resource *r,
> > +				   struct rdt_domain *d, u32 evtid, u32 val) {
> > +	struct mon_config_info mon_info = {0};
> > +	int ret = 0;
> > +
> > +	/* mon_config cannot be more than the supported set of events */
> > +	if (val > MAX_EVT_CONFIG_BITS) {
> > +		rdt_last_cmd_puts("Invalid event configuration\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/*
> > +	 * Read the current config value first. If both are the same then
> > +	 * no need to write it again.
> > +	 */
> > +	mon_info.evtid = evtid;
> > +	mondata_config_read(d, &mon_info);
> > +	if (mon_info.mon_config == val)
> > +		goto out;
> > +
> > +	mon_info.mon_config = val;
> > +
> > +	/*
> > +	 * Update MSR_IA32_EVT_CFG_BASE MSRs on all the CPUs in the
> > +	 * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE
> > +	 * are scoped at the domain level. Writing any of these MSRs
> > +	 * on one CPU is supposed to be observed by all CPUs in the
> > +	 * domain. However, the hardware team recommends to update
> > +	 * these MSRs on all the CPUs in the domain.
> > +	 */
> > +	on_each_cpu_mask(&d->cpu_mask, mon_event_config_write,
> &mon_info,
> > +1);

Forgot about this.  This snippet is going to change. I have tested and works fine.
How about this?

        /*
         * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the
         * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE
         * are scoped at the domain level. Writing any of these MSRs
         * on one CPU is supposed to be observed by all CPUs in the domain.
         */
        smp_call_function_any(&d->cpu_mask, mon_event_config_write, &mon_info, 1);


/*
> +	 * Update MSR_IA32_EVT_CFG_BASE MSRs on all the CPUs in the
> +	 * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE
> +	 * are scoped at the domain level. Writing any of these MSRs
> +	 * on one CPU is supposed to be observed by all CPUs in the
> +	 * domain. However, the hardware team recommends to update
> +	 * these MSRs on all the CPUs in the domain.
> +	 */
> +	on_each_cpu_mask(&d->cpu_mask, mon_event_config_write, &mon_info, 1);


> > +
> > +	/*
> > +	 * When an Event Configuration is changed, the bandwidth counters
> > +	 * for all RMIDs and Events will be cleared by the hardware. The
> > +	 * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for
> > +	 * every RMID on the next read to any event for every RMID.
> > +	 * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62)
> > +	 * cleared while it is tracked by the hardware. Clear the
> > +	 * mbm_local and mbm_total counts for all the RMIDs.
> > +	 */
> > +	resctrl_arch_reset_rmid_all(r, d);
> 
> If I understand correctly the expectation is that when user space read counters
> (via mon_data files) right after the configuration was changed then this read
> will return "Unavailable" and then the next read will return data.
> 
> If this is the case then I think a snippet about this user experience would be
> helpful to add to the documentation.
> 
> Have you considered doing a preemptive read on the RMIDs that are in use to
> avoid users encountering "Unavailable"? I assume doing so on a busy system
> could potentially involve hundreds of register reads/writes.
> 
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> > +static int mon_config_write(struct rdt_resource *r, char *tok, u32
> > +evtid) {
> > +	char *dom_str = NULL, *id_str;
> > +	unsigned long dom_id, val;
> > +	struct rdt_domain *d;
> > +	int ret = 0;
> > +
> > +next:
> > +	if (!tok || tok[0] == '\0')
> > +		return 0;
> > +
> > +	/* Start processing the strings for each domain */
> > +	dom_str = strim(strsep(&tok, ";"));
> > +	id_str = strsep(&dom_str, "=");
> > +
> > +	if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
> > +		rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!dom_str || kstrtoul(dom_str, 16, &val)) {
> > +		rdt_last_cmd_puts("Non-numeric event configuration
> value\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	list_for_each_entry(d, &r->domains, list) {
> > +		if (d->id == dom_id) {
> > +			ret = mbm_config_write_domain(r, d, evtid, val);
> > +			if (ret)
> > +				return -EINVAL;
> > +			goto next;
> > +		}
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> > +
> > +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
> > +					    char *buf, size_t nbytes,
> > +					    loff_t off)
> > +{
> > +	struct rdt_resource *r = of->kn->parent->priv;
> > +	int ret;
> > +
> > +	/* Valid input requires a trailing newline */
> > +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> > +		return -EINVAL;
> > +
> > +	cpus_read_lock();
> > +	mutex_lock(&rdtgroup_mutex);
> > +
> > +	rdt_last_cmd_clear();
> > +
> > +	buf[nbytes - 1] = '\0';
> > +
> > +	ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);
> > +
> > +	mutex_unlock(&rdtgroup_mutex);
> > +	cpus_read_unlock();
> > +
> > +	return ret ?: nbytes;
> > +}
> > +
> >  /* rdtgroup information files for one cache resource. */  static
> > struct rftype res_common_files[] = {
> >  	{
> > @@ -1617,9 +1741,10 @@ static struct rftype res_common_files[] = {
> >  	},
> >  	{
> >  		.name		= "mbm_total_bytes_config",
> > -		.mode		= 0444,
> > +		.mode		= 0644,
> >  		.kf_ops		= &rdtgroup_kf_single_ops,
> >  		.seq_show	= mbm_total_bytes_config_show,
> > +		.write		= mbm_total_bytes_config_write,
> >  	},
> >  	{
> >  		.name		= "mbm_local_bytes_config",
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index
> > 0cee154abc9f..e4dc65892446 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -250,6 +250,16 @@ int resctrl_arch_rmid_read(struct rdt_resource
> > *r, struct rdt_domain *d,  void resctrl_arch_reset_rmid(struct rdt_resource
> *r, struct rdt_domain *d,
> >  			     u32 rmid, enum resctrl_event_id eventid);
> >
> > +/**
> > + * resctrl_arch_reset_rmid_all() - Reset any private state associated with
> > + * 				   all the rmids.
> 
> It could be more explicit:
> "Reset all private state associated with all rmids and eventids."
> 
> > + * @r:		The domain's resource.
> > + * @d:		The rmid's domain.
> 
> This copy&paste needs some changes to match this new utility.
> How about:
> @r: The resctrl resource.
> @d: The domain for which all architectural counter state will be cleared.
> 
> I think it can be improved more but the above could be a start (please do not
> copy verbatim but ensure style is correct.)
> 
> Keep in mind that this utility does not clear the non-architectural counter state.
> This does not apply to AMD since that state is used by the software controller,
> but it needs to be kept in mind if another usage for this utility arises.
> 
> > + *
> > + * This can be called from any CPU.
> > + */
> > +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct
> > +rdt_domain *d);
> > +
> >  extern unsigned int resctrl_rmid_realloc_threshold;  extern unsigned
> > int resctrl_rmid_realloc_limit;
> >
> >
> >
> 
> The above hunk fails the "no spaces before tabs" checkpatch check.
> 
> Reinette




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux