Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Wed, 6 Dec 2023 12:02:40 -0800

Hi Peter,

On 12/6/2023 10:38 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Dec 5, 2023 at 5:47 PM Reinette Chatre
> <reinette.chatre@xxxxxxxxx> wrote:
>>
>> On 12/5/2023 4:33 PM, Peter Newman wrote:
>>> On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
>>> <reinette.chatre@xxxxxxxxx> wrote:
>>>> On 12/1/2023 12:56 PM, Peter Newman wrote:
> 
>>> Ignoring any present-day resctrl interfaces, what we minimally need is...
>>>
>>> 1. global "start measurement", which enables a
>>> read-counters-on-context switch flag, and broadcasts an IPI to all
>>> CPUs to read their current count
>>> 2. wait 5 seconds
>>> 3. global "end measurement", to IPI all CPUs again for final counts
>>> and clear the flag from step 1
>>>
>>> Then the user could read at their leisure all the (frozen) event
>>> counts from memory until the next measurement begins.
>>>
>>> In our case, if we're measuring as often as 5 seconds for every
>>> minute, that will already be a 12x aggregate reduction in overhead,
>>> which would be worthwhile enough.
>>
>> The "con" here would be that during those 5 seconds (which I assume would be
>> controlled via user space so potentially shorter or longer) all tasks in the
>> system is expected to have significant (but yet to be measured) impact
>> on context switch delay.
> 
> Yes, of course. In the worst case I've measured, Zen2, it's roughly a
> 1700-cycle context switch penalty (~20%) for tasks in different
> monitoring groups. Bad, but the benefit we gain from the per-RMID MBM
> data makes up for it several times over if we only pay the cost during a
> measurement.

I see.

> 
>> I expect the overflow handler should only be run during the measurement
>> timeframe, to not defeat the "at their leisure" reading of counters.
> 
> Yes, correct. We wouldn't be interested in overflows of the hardware
> counter when not actively measuring bandwidth.
> 
> 
>>
>>>>> The second involves avoiding the situation where a hardware counter
>>>>> could be deallocated: Determine the number of simultaneous RMIDs
>>>>> supported, reduce the effective number of RMIDs available to that
>>>>> number. Use the default RMID (0) for all "unassigned" monitoring
>>>>
>>>> hmmm ... so on the one side there is "only the RMID within the PQR
>>>> register can be guaranteed to be tracked by hardware" and on the
>>>> other side there is "A given implementation may have insufficient
>>>> hardware to simultaneously track the bandwidth for all RMID values
>>>> that the hardware supports."
>>>>
>>>> From the above there seems to be something in the middle where
>>>> some subset of the RMID values supported by hardware can be used
>>>> to simultaneously track bandwidth? How can it be determined
>>>> what this number of RMID values is?
>>>
>>> In the context of AMD, we could use the smallest number of CPUs in any
>>> L3 domain as a lower bound of the number of counters.
>>
>> Could you please elaborate on this? (With the numbers of CPUs nowadays this
>> may be many RMIDs, perhaps even more than what ABMC supports.)
> 
> I think the "In the context of AMD" part is key. This feature would only
> be applicable to the AMD implementations we have today which do not
> implement ABMC.  I believe the difficulties are unique to the topologies
> of these systems: many small L3 domains per node with a relatively small
> number of CPUs in each. If the L3 domains were large and few, simply
> restricting the number of RMIDs and allocating on group creation as we
> do today would probably be fine.
> 
>> I am missing something here since it is not obvious to me how this lower
>> bound is determined. Let's assume that there are as many monitor groups
>> (and thus as many assigned RMIDs) as there are CPUs in a L3 domain.
>> Each monitor group may have many tasks. It can be expected that at any
>> moment in time only a subset of assigned RMIDs are assigned to CPUs
>> via the CPUs' PQR registers. Of those RMIDs that are not assigned to
>> CPUs, how can it be certain that they continue to be tracked by hardware?
> 
> Are you asking whether the counters will ever be reclaimed proactively?
> The behavior I've observed is that writing a new RMID into a PQR_ASSOC
> register when all hardware counters in the domain are allocated will
> trigger the reallocation.

"When all hardware counters in the domain are allocated" sounds like the
ideal scenario with the kernel knowing how many counters there are and
each counter is associated with a unique RMID. As long as kernel does not
attempt to monitor another RMID this would accurately monitor the
monitor groups with "assigned" RMID.

Adding support for hardware without specification and guaranteed
behavior can potentially run into unexpected scenarios.

For example, there is no guarantee on how the counters are assigned.
The OS and hardware may thus have different view of which hardware
counter is "free". OS may write a new RMID to PQR_ASSOC believing that
there is a counter available while hardware has its own mechanism of
allocation and may reallocate a counter that is in use by an RMID that
the OS believes to be "assigned". I do not think anything prevents
hardware from doing this.

> However, I admit the wording in the PQoS spec[1] is only written to
> support the permanent-assignment workaround in the current patch series:
> 
> "All RMIDs which are currently in use by one or more processors in the
> QOS domain will be tracked. The hardware will always begin tracking a
> new RMID value when it gets written to the PQR_ASSOC register of any of
> the processors in the QOS domain and it is not already being tracked.
> When the hardware begins tracking an RMID that it was not previously
> tracking, it will clear the QM_CTR for all events in the new RMID."
> 
> I would need to confirm whether this is the case and request the
> documentation be clarified if it is.

Indeed. Once an RMID is "assigned" then the expectation is that a
counter will be dedicated to it but a PQR_ASSOC register may not see that
RMID for potentially long intervals. With the above guarantees hardware
will be within its rights to reallocate that RMID's counter even if
there are other counters that are "free" from OS perspective.

>>>>> While the second feature is a lot more disruptive at the filesystem
>>>>> layer, it does eliminate the added context switch overhead. Also, it
>>>>
>>>> Which changes to filesystem layer are you anticipating?
>>>
>>> Roughly speaking...
>>>
>>> 1. The proposed "assign" interface would have to become more indirect
>>> to avoid understanding how assign could be implemented on various
>>> platforms.
>>
>> It is almost starting to sound like we could learn from the tracing
>> interface where individual events can be enabled/disabled ... with several
>> events potentially enabled with an "enable" done higher in hierarchy, perhaps
>> even globally to support the first approach ...
> 
> Sorry, can you clarify the part about the tracing interface? Tracing to
> support dynamic autoconfiguration of events?

I do not believe we are attempting to do anything revolutionary here so
I would like to consider other interfaces that user space may be
familiar and comfortable with. The first that came to mind was the
tracefs interface and how user space interacts with it to enable
trace events. tracefs uses the "enable" file that is present at
different levels of the hierarchy that user space can use to
enable tracing of all events in hierarchy. There is also the
global "tracing_on" that user space can use to dynamically start/stop
tracing without needing to frequently enable/disable events of interest.

I do see some parallels with the discussions we have been having. I am not
proposing that we adapt tracefs interface, but instead that we can perhaps
learn from it.

Reinette