RE: [PATCH v8 10/13] x86/resctrl: Add sysfs interface to write mbm_total_bytes_config

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

Hi James,

> -----Original Message-----
> From: James Morse <james.morse@xxxxxxx>
> Sent: Wednesday, December 7, 2022 11:21 AM
> To: Moger, Babu <Babu.Moger@xxxxxxx>
> Cc: fenghua.yu@xxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx;
> hpa@xxxxxxxxx; paulmck@xxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> quic_neeraju@xxxxxxxxxxx; rdunlap@xxxxxxxxxxxxx;
> damien.lemoal@xxxxxxxxxxxxxxxxxx; songmuchun@xxxxxxxxxxxxx;
> peterz@xxxxxxxxxxxxx; jpoimboe@xxxxxxxxxx; pbonzini@xxxxxxxxxx;
> chang.seok.bae@xxxxxxxxx; pawan.kumar.gupta@xxxxxxxxxxxxxxx;
> jmattson@xxxxxxxxxx; daniel.sneddon@xxxxxxxxxxxxxxx; Das1, Sandipan
> <Sandipan.Das@xxxxxxx>; tony.luck@xxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; bagasdotme@xxxxxxxxx; eranian@xxxxxxxxxx;
> corbet@xxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; bp@xxxxxxxxx;
> reinette.chatre@xxxxxxxxx
> Subject: Re: [PATCH v8 10/13] x86/resctrl: Add sysfs interface to write
> mbm_total_bytes_config
> 
> Hi Babu,
> 
> (Nit: all the 'sysfs' in the subjects should really be 'resctrl', but as they already
> have 'x86/resctrl', could you just remove the sysfs?
> This patch would be "x86/resctrl: Add interface to write
> mbm_total_bytes_config")

Sure. Will change it.
> 
> On 04/11/2022 20:01, Babu Moger wrote:
> > The current event configuration for mbm_total_bytes can be changed by
> > the user by writing to the file
> > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.
> >
> > The event configuration settings are domain specific and will affect
> > all the CPUs in the domain.
> >
> > Following are the types of events supported:
> >
> > ====
> ===========================================================
> > Bits   Description
> > ====
> ===========================================================
> > 6      Dirty Victims from the QOS domain to all types of memory
> > 5      Reads to slow memory in the non-local NUMA domain
> > 4      Reads to slow memory in the local NUMA domain
> > 3      Non-temporal writes to non-local NUMA domain
> > 2      Non-temporal writes to local NUMA domain
> > 1      Reads to memory in the non-local NUMA domain
> > 0      Reads to memory in the local NUMA domain
> > ====
> ===========================================================
> >
> > For example:
> > To change the mbm_total_bytes to count only reads on domain 0, the
> > bits 0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33).
> > Run the command.
> > 	$echo  0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> >
> > To change the mbm_total_bytes to count all the slow memory reads on
> > domain 1, the bits 4 and 5 needs to be set which is 110000b (in hex 0x30).
> > Run the command.
> > 	$echo  1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 18f9588a41cf..0cdccb69386e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -1505,6 +1505,133 @@ static int mbm_local_bytes_config_show(struct
> kernfs_open_file *of,
> >  	return 0;
> >  }
> >
> > +static void mon_event_config_write(void *info) {
> > +	struct mon_config_info *mon_info = info;
> > +	u32 index;
> > +
> > +	index = mon_event_config_index_get(mon_info->evtid);
> > +	if (index >= MAX_CONFIG_EVENTS) {
> > +		pr_warn_once("Invalid event id %d\n", mon_info->evtid);
> > +		return;
> > +	}
> > +	wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
> }
> > +
> > +static int mbm_config_write(struct rdt_resource *r, struct rdt_domain *d,
> > +			    u32 evtid, u32 val)
> > +{
> > +	struct mon_config_info mon_info = {0};
> > +	int ret = 0;
> > +
> > +	rdt_last_cmd_clear();
> > +
> > +	/* mon_config cannot be more than the supported set of events */
> > +	if (val > MAX_EVT_CONFIG_BITS) {
> > +		rdt_last_cmd_puts("Invalid event configuration\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/*
> > +	 * Read the current config value first. If both are same then
> > +	 * we don't need to write it again.
> > +	 */
> > +	mon_info.evtid = evtid;
> 
> > +	mondata_config_read(d, &mon_info);
> 
> This reads the MSR on this CPU, which gets the result for this domain...

[1] No. This read happens at the target domain. 

static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
{
        smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
}

> 
> 
> > +	if (mon_info.mon_config == val)
> > +		goto write_exit;
> > +
> > +	mon_info.mon_config = val;
> > +
> > +	/*
> > +	 * Update MSR_IA32_EVT_CFG_BASE MSRs on all the CPUs in the
> > +	 * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE
> > +	 * are scoped at the domain level. Writing any of these MSRs
> > +	 * on one CPU is supposed to be observed by all CPUs in the
> > +	 * domain. However, the hardware team recommends to update
> > +	 * these MSRs on all the CPUs in the domain.
> > +	 */
> 
> > +	on_each_cpu_mask(&d->cpu_mask, mon_event_config_write,
> &mon_info,
> > +1);
> 
> ... but here you IPI all the CPUs in the target domain to update them.

[2] There have been some changes in this area recently. The requirement of writing the value on all the CPUs in the domain is not required anymore. I am working on verifying this right now.  If everything works, then I can do 
smp_call_function_any(&d->cpu_mask, mon_event_config_write,  &mon_info, 1);

I will confirm this soon.

> 
> This means you unnecessarily IPI the CPUs in the target domain if they already
> had this value, but the write syscall occurred on a domain that differs. This isn't
> what you intended, but its benign.
> More of a problem is: Won't this get skipped if the write syscall occurs on a
> domain that happens to have the target configuration already?

Do you still think this is a problem after my comment [1] above?  Or Am I missing something?

> 
> Because you need the same value to be written on every CPU ... what happens
> to CPUs that are offline when the configuration is changed? Do they keep their
> previous value, or does it get reset?

The contents of this MSR register are held outside of all the cores.  If the value changes while a cpu is offline, and it reads it once it comes online, it will get the new value.
> 
> 
> I think this is best solved with a percpu variable for the current value of the
> MSR. You can then read it for CPUs in a remote domain, and only issue IPIs to
> 'sync' the value if needed. You can then re-use the sync call in
> resctrl_online_cpu() to set the MSR to whatever value it should currently be.

This may not be required with my comment 1 and 2 above.

> 
> 
> > +
> > +	/*
> > +	 * When an Event Configuration is changed, the bandwidth counters
> > +	 * for all RMIDs and Events will be cleared by the hardware. The
> > +	 * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for
> > +	 * every RMID on the next read to any event for every RMID.
> > +	 * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62)
> > +	 * cleared while it is tracked by the hardware. Clear the
> > +	 * mbm_local and mbm_total counts for all the RMIDs.
> > +	 */
> > +	memset(d->mbm_local, 0, sizeof(struct mbm_state) * r->num_rmid);
> > +	memset(d->mbm_total, 0, sizeof(struct mbm_state) * r->num_rmid);
> > +
> > +write_exit:
> > +	return ret;
> > +}
> 
> 
> > +static int mon_config_parse(struct rdt_resource *r, char *tok, u32
> > +evtid) {
> > +	char *dom_str = NULL, *id_str;
> > +	unsigned long dom_id, val;
> > +	struct rdt_domain *d;
> > +	int ret = 0;
> > +
> > +next:
> > +	if (!tok || tok[0] == '\0')
> > +		return 0;
> > +
> > +	/* Start processing the strings for each domain */
> > +	dom_str = strim(strsep(&tok, ";"));
> > +	id_str = strsep(&dom_str, "=");
> > +
> > +	if (!dom_str || kstrtoul(id_str, 10, &dom_id)) {
> > +		rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!dom_str || kstrtoul(dom_str, 16, &val)) {
> > +		rdt_last_cmd_puts("Missing '=' or non-numeric event
> configuration value\n");
> > +		return -EINVAL;
> > +	}
> 
> This is parsing the same format strings as parse_line(). Is there any chance that
> code could be re-used instead of duplicated? This way anything that is added to
> the format (or bugs found!) only need supporting in once place.

I have checked on reusing the parse_line. The parse_line is specifically written for schemata.  We can't reuse parse_line without changing it completely.

Thanks
Babu
> 
> 
> 
> > +	list_for_each_entry(d, &r->domains, list) {
> > +		if (d->id == dom_id) {
> > +			ret = mbm_config_write(r, d, evtid, val);
> > +			if (ret)
> > +				return -EINVAL;
> > +			goto next;
> > +		}
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> 
> 
> Thanks,
> 
> James




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux