Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Tue, 16 May 2023 17:05:50 -0700

Hi Peter,

On 5/15/2023 7:42 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, May 12, 2023 at 5:26 PM Reinette Chatre
> <reinette.chatre@xxxxxxxxx> wrote:
>> On 5/12/2023 6:25 AM, Peter Newman wrote:
>>> On Thu, May 11, 2023 at 11:37 PM Reinette Chatre
>>> <reinette.chatre@xxxxxxxxx> wrote:
>>>> On 4/21/2023 7:17 AM, Peter Newman wrote:
>>>>> Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM
>>>>> event counts into its current software RMID. The delta for each CPU is
>>>>> determined by tracking the previous event counts in per-CPU data.  The
>>>>> software byte counts reside in the arch-independent mbm_state
>>>>> structures.
>>>>
>>>> Could you elaborate why the arch-independent mbm_state was chosen?
>>>
>>> It largely had to do with how many soft RMIDs to implement. For our
>>> own needs, we were mainly concerned with getting back to the number of
>>> monitoring groups the hardware claimed to support, so there wasn't
>>> much internal motivation to support an unbounded number of soft RMIDs.
>>
>> Apologies for not being explicit, I was actually curious why the
>> arch-independent mbm_state, as opposed to the arch-dependent state, was
>> chosen.
>>
>> I think the lines are getting a bit blurry here with the software RMID
>> feature added as a resctrl filesystem feature (and thus non architectural),
>> but it is specific to AMD architecture.
> 
> The soft RMID solution applies conceptually to any system where the
> number of hardware counters is smaller than the number of desired
> monitoring groups, but at least as large as the number of CPUs. It's a
> solution we may need to rely on more in the future as it's easier for
> monitoring hardware to scale to the number of CPUs than (CPUs *
> mbm_domains). I believed the counts in bytes would apply to the user
> interface universally.
> 
> However, I did recently rebase these changes onto one of James's MPAM
> snapshot branches and __mbm_flush() did end up fitting better on the
> arch-dependent side, so I was forced to move the counters over to
> arch_mbm_state because on the snapshot branch the arch-dependent code
> cannot see the arch-independent mbm_state structure. I then created
> resctrl_arch-() helpers for __mon_event_count() to read the counts
> from the arch_mbm_state.
> 
> In hindsight, despite generic-looking code being able to retrieve the
> CPU counts with resctrl_arch_rmid_read(), the permanent assignment of
> a HW RMID to a CPU is an implementation-detail specific to the
> RDT/PQoS interface and would probably not align to a theoretical MPAM
> implementation.

Indeed. There are a couple of points in this work that blurs the clear
boundary that the MPAM enabling is trying to establish. We should keep
this in mind when looking how this should be solved so that it does
not create another item that needs to be untangled. A small but significant
start may be to start referring to this as "soft mon id" or something
else that is generic. Especially since this is proposed as a mount
option and thus tied to the filesystem.

>>>>> +/*
>>>>> + * Called from context switch code __resctrl_sched_in() when the current soft
>>>>> + * RMID is changing or before reporting event counts to user space.
>>>>> + */
>>>>> +void resctrl_mbm_flush_cpu(void)
>>>>> +{
>>>>> +     struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>>>> +     int cpu = smp_processor_id();
>>>>> +     struct rdt_domain *d;
>>>>> +
>>>>> +     d = get_domain_from_cpu(cpu, r);
>>>>> +     if (!d)
>>>>> +             return;
>>>>> +
>>>>> +     if (is_mbm_local_enabled())
>>>>> +             __mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
>>>>> +     if (is_mbm_total_enabled())
>>>>> +             __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
>>>>> +}
>>>>
>>>> This (potentially) adds two MSR writes and two MSR reads to what could possibly
>>>> be quite slow MSRs if it was not designed to be used in context switch. Do you
>>>> perhaps have data on how long these MSR reads/writes take on these systems to get
>>>> an idea about the impact on context switch? I think this data should feature
>>>> prominently in the changelog.
>>>
>>> I can probably use ftrace to determine the cost of an __rmid_read()
>>> call on a few implementations.
>>
>> On a lower level I think it may be interesting to measure more closely
>> just how long a wrmsr and rdmsr take on these registers. It may be interesting
>> if you, for example, use rdtsc_ordered() before and after these calls, and then
>> compare it to how long it takes to write the PQR register that has been
>> designed to be used in context switch.
>>
>>> To understand the overall impact to context switch, I can put together
>>> a scenario where I can control whether the context switches being
>>> measured result in change of soft RMID to prevent the data from being
>>> diluted by non-flushing switches.
>>
>> This sounds great. Thank you very much.
> 
> I used a simple parent-child pipe loop benchmark with the parent in
> one monitoring group and the child in another to trigger 2M
> context-switches on the same CPU and compared the sample-based
> profiles on an AMD and Intel implementation. I used perf diff to
> compare the samples between hard and soft RMID switches.
> 
> Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
> 
>               +44.80%  [kernel.kallsyms]  [k] __rmid_read
>     10.43%     -9.52%  [kernel.kallsyms]  [k] __switch_to
> 
> AMD EPYC 7B12 64-Core Processor:
> 
>               +28.27%  [kernel.kallsyms]  [k] __rmid_read
>     13.45%    -13.44%  [kernel.kallsyms]  [k] __switch_to
> 
> Note that a soft RMID switch that doesn't change CLOSID skips the
> PQR_ASSOC write completely, so from this data I can roughly say that
> __rmid_read() is a little over 2x the length of a PQR_ASSOC write that
> changes the current RMID on the AMD implementation and about 4.5x
> longer on Intel.
> 
> Let me know if this clarifies the cost enough or if you'd like to also
> see instrumented measurements on the individual WRMSR/RDMSR
> instructions.

I can see from the data the portion of total time spent in __rmid_read().
It is not clear to me what the impact on a context switch is. Is it
possible to say with this data that: this solution makes a context switch
x% slower?

I think it may be optimistic to view this as a replacement of a PQR write.
As you point out, that requires that a CPU switches between tasks with the
same CLOSID. You demonstrate that resctrl already contributes a significant
delay to __switch_to - this work will increase that much more, it has to
be clear about this impact and motivate that it is acceptable.

Reinette