Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Mon, 28 Aug 2023 14:20:04 -0700

On 8/28/2023 1:59 PM, Tony Luck wrote:
> On Mon, Aug 28, 2023 at 12:56:27PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 8/28/2023 11:46 AM, Tony Luck wrote:
>>> On Mon, Aug 28, 2023 at 10:05:31AM -0700, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>
>>>> On 8/25/2023 10:05 AM, Tony Luck wrote:
>>>>> On Fri, Aug 11, 2023 at 10:29:25AM -0700, Reinette Chatre wrote:
>>>>>> On 7/22/2023 12:07 PM, Tony Luck wrote:
>>>>
>>>>>>> Change all places where monitoring functions walk the list of
>>>>>>> domains to use the new "mondomains" list instead of the old
>>>>>>> "domains" list.
>>>>>>
>>>>>> I would not refer to it as "the old domains list" as it creates
>>>>>> impression that this is being replaced. The changelog makes
>>>>>> no mention that domains list will remain and be dedicated to
>>>>>> control domains. I think this is important to include in description
>>>>>> of this change.
>>>>>
>>>>> I've rewritten the entire commit message incorporating your suggestions.
>>>>> V6 will be posted soon (after I get some time on an SNC SPR to check
>>>>> that it all works!)
>>>>
>>>> I seem to have missed v5.
>>>
>>> I simply can't count. You are correct that next version to be posted
>>> will be v5.
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
>>>>>>> ---
>>>>>>>  include/linux/resctrl.h                   |  10 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/internal.h    |   2 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/core.c        | 195 +++++++++++++++-------
>>>>>>>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |   2 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/monitor.c     |   2 +-
>>>>>>>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  30 ++--
>>>>>>>  6 files changed, 167 insertions(+), 74 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>>>>> index 8334eeacfec5..1267d56f9e76 100644
>>>>>>> --- a/include/linux/resctrl.h
>>>>>>> +++ b/include/linux/resctrl.h
>>>>>>> @@ -151,9 +151,11 @@ struct resctrl_schema;
>>>>>>>   * @mon_capable:	Is monitor feature available on this machine
>>>>>>>   * @num_rmid:		Number of RMIDs available
>>>>>>>   * @cache_level:	Which cache level defines scope of this resource
>>>>>>> + * @mon_scope:		Scope of this resource if different from cache_level
>>>>>>
>>>>>> I think this addition should be deferred. As it is here it the "if different
>>>>>> from cache_level" also creates many questions (when will it be different?
>>>>>> how will it be determined that the scope is different in order to know that
>>>>>> mon_scope should be used?)
>>>>>
>>>>> I've gone in a different direction. V6 renames "cache_level" to
>>>>> "ctrl_scope". I think this makes intent clear from step #1.
>>>>>
>>>>
>>>> This change is not clear to me. Previously you changed this name
>>>> but kept using it in code specific to cache levels. It is not clear
>>>> to me how this time's rename would be different.
>>>
>>> The current "cache_level" field in the structure is describing
>>> the scope of each instance using the cache level (2 or 3) as the
>>> method to describe which CPUs are considered part of a group. Currently
>>> the scope is the same for both control and monitor resources.
>>
>> Right.
>>
>>>
>>> Would you like to see patches in this progrssion:
>>>
>>> 1) Rename "cache_level" to "scope". With commit comment that future
>>> patches are going to base the scope on NUMA nodes in addtion to sharing
>>> caches at particular levels, and will split into separate control and
>>> monitor scope.
>>>
>>> 2) Split the "scope" field from first patch into "ctrl_scope" and
>>> "mon_scope" (also with the addition of the new list for the mon_scope).
>>>
>>> 3) Add "node" as a new option for scope in addtion to L3 and L2 cache.
>>>
>>
>> hmmm - my comment cannot be addressed through patch re-ordering.
>> If I understand correctly you plan to change the name of "cache_level"
>> to "ctrl_scope". My comment is that this obfuscates the code as long as
>> you use this variable to compare against data that can only represent cache
>> levels. This just repeats what I wrote in
>> https://lore.kernel.org/lkml/09847c37-66d7-c286-a313-308eaa338c64@xxxxxxxxx/
> 
> I'm proposing more than just re-ordering. The above sequence is a
> couple of extra patches in the series.
> 
> Existing state of code:
> 
> 	There is a single field named "cache_level" that describes how
> 	CPUs are assigned to domains based on their sharing of a cache
> 	at a particular level. Hard-coded values of "2" and "3" are used
> 	to describe the level. This is just a scope description of which
> 	CPUs are grouped together. But it is limited to just doing so
> 	based on which caches are shared by those CPUs.
> 
> Step 1:
> 
> 	Change the name of the field s/cache_level/scope/. Provide an
> 	enum with values RESCTRL_L3_CACHE, RESCTRL_L2_CACHE aand use
> 	those througout code instead of 3, 2, and implictly passing
> 	the resctrl scope to functions like get_cpu_cacheinfo_id()
> 
> 	Add get_domain_id_from_scope() function that takes the enum
> 	values for scope and converts to "3", "2" to pass to
> 	get_cpu_cacheinfo_id().
> 
> 	No functional change. Just broadening the meaning of the field
> 	so that it can in future patches to describe scopes that
> 	aren't a function of sharing a cache of a particular level.

Right. So if I understand you are planning the same change as you did
in V3, like below, and my original comment still stands.

 @@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
  	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
  	ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
  	for (i = 0; i < ci->num_leaves; i++) {
 -		if (ci->info_list[i].level == r->cache_level) {
 +		if (ci->info_list[i].level == r->scope) {
  			size = ci->info_list[i].size / r->cache.cbm_len * num_b;
  			break;
  		}

>>>>>>> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>>>>>>>   */
>>>>>>>  static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>>>>>>  {
>>>>>>> -	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>>>>>> -	struct list_head *add_pos = NULL;
>>>>>>> -	struct rdt_hw_domain *hw_dom;
>>>>>>> -	struct rdt_domain *d;
>>>>>>> -	int err;
>>>>>>> -
>>>>>>> -	d = rdt_find_domain(r, id, &add_pos);
>>>>>>> -	if (IS_ERR(d)) {
>>>>>>> -		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	if (d) {
>>>>>>> -		cpumask_set_cpu(cpu, &d->cpu_mask);
>>>>>>> -		if (r->cache.arch_has_per_cpu_cfg)
>>>>>>> -			rdt_domain_reconfigure_cdp(r);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
>>>>>>> -	if (!hw_dom)
>>>>>>> -		return;
>>>>>>> -
>>>>>>> -	d = &hw_dom->d_resctrl;
>>>>>>> -	d->id = id;
>>>>>>> -	cpumask_set_cpu(cpu, &d->cpu_mask);
>>>>>>> -
>>>>>>> -	rdt_domain_reconfigure_cdp(r);
>>>>>>> -
>>>>>>> -	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>>>>>>> -		domain_free(hw_dom);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>>>>>>> -		domain_free(hw_dom);
>>>>>>> -		return;
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	list_add_tail(&d->list, add_pos);
>>>>>>> -
>>>>>>> -	err = resctrl_online_domain(r, d);
>>>>>>> -	if (err) {
>>>>>>> -		list_del(&d->list);
>>>>>>> -		domain_free(hw_dom);
>>>>>>> -	}
>>>>>>> +	if (r->alloc_capable)
>>>>>>> +		domain_add_cpu_ctrl(cpu, r);
>>>>>>> +	if (r->mon_capable)
>>>>>>> +		domain_add_cpu_mon(cpu, r);
>>>>>>>  }
>>>>>>
>>>>>> A resource could be both alloc and mon capable ... both
>>>>>> domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
>>>>>> Should domain_add_cpu_mon() still be run for a CPU if
>>>>>> domain_add_cpu_ctrl() failed? 
>>>>>>
>>>>>> Looking ahead the CPU should probably also not be added
>>>>>> to the default groups mask if a failure occurred.
>>>>>
>>>>> Existing code doesn't do anything for the case where a CPU
>>>>> can't be added to a domain (probably the only real error case
>>>>> is failure to allocate memory for the domain structure).
>>>>
>>>> Is my statement about CPU being added to default group mask
>>>> incorrect? Seems like a potential issue related to domain's
>>>> CPU mask also.
>>>>
>>>> Please see my earlier question. Existing code does not proceed
>>>> with monitor initialization if control initialization fails and
>>>> undoes control initialization if monitor initialization fails. 
>>>
>>> Existing code silently continues if a domain structure cannot
>>> be allocated to add a CPU to a domain:
>>>
>>> 503 static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>> 504 {
>>> 505         int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
>>> 506         struct list_head *add_pos = NULL;
>>> 507         struct rdt_hw_domain *hw_dom;
>>> 508         struct rdt_domain *d;
>>> 509         int err;
>>>
>>> ...
>>>
>>> 523
>>> 524         hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
>>> 525         if (!hw_dom)
>>> 526                 return;
>>> 527
>>>
>>
>>
>> Right ... and if it returns silently as above it runs:
>>
>> static int resctrl_online_cpu(unsigned int cpu)
>> {
>>
>>
>> 	for_each_capable_rdt_resource(r)
>> 		domain_add_cpu(cpu, r);
>> 	>>>>> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); <<<<<<<<
>>
>> }
>>
>> Also, note within domain_add_cpu():
>>
>> static void domain_add_cpu(int cpu, struct rdt_resource *r)
>> {
>>
>>
>> 	...
>> 	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>> 		domain_free(hw_dom);
>> 		return;
>> 	}
>>
>> 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>> 		domain_free(hw_dom);
>> 		return;
>> 	}
>>
>> 	...
>> }
>>
>> The above is the other item that I've been trying to discuss
>> with you. Note that existing resctrl will not initialize monitoring if
>> control could not be initialized.
>> Compare with this submission:
>>
>> 	if (r->alloc_capable)
>> 		domain_add_cpu_ctrl(cpu, r);
>> 	if (r->mon_capable)
>> 		domain_add_cpu_mon(cpu, r);
>>
>>

I'll stop trying

Reinette