Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

Peter Newman <peternewman@xxxxxxxxxx> · Wed, 6 Dec 2023 10:38:15 -0800

Hi Reinette,

On Tue, Dec 5, 2023 at 5:47 PM Reinette Chatre
<reinette.chatre@xxxxxxxxx> wrote:
>
> On 12/5/2023 4:33 PM, Peter Newman wrote:
> > On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
> > <reinette.chatre@xxxxxxxxx> wrote:
> >> On 12/1/2023 12:56 PM, Peter Newman wrote:

> > Ignoring any present-day resctrl interfaces, what we minimally need is...
> >
> > 1. global "start measurement", which enables a
> > read-counters-on-context switch flag, and broadcasts an IPI to all
> > CPUs to read their current count
> > 2. wait 5 seconds
> > 3. global "end measurement", to IPI all CPUs again for final counts
> > and clear the flag from step 1
> >
> > Then the user could read at their leisure all the (frozen) event
> > counts from memory until the next measurement begins.
> >
> > In our case, if we're measuring as often as 5 seconds for every
> > minute, that will already be a 12x aggregate reduction in overhead,
> > which would be worthwhile enough.
>
> The "con" here would be that during those 5 seconds (which I assume would be
> controlled via user space so potentially shorter or longer) all tasks in the
> system is expected to have significant (but yet to be measured) impact
> on context switch delay.

Yes, of course. In the worst case I've measured, Zen2, it's roughly a
1700-cycle context switch penalty (~20%) for tasks in different
monitoring groups. Bad, but the benefit we gain from the per-RMID MBM
data makes up for it several times over if we only pay the cost during a
measurement.

> I expect the overflow handler should only be run during the measurement
> timeframe, to not defeat the "at their leisure" reading of counters.

Yes, correct. We wouldn't be interested in overflows of the hardware
counter when not actively measuring bandwidth.

>
> >>> The second involves avoiding the situation where a hardware counter
> >>> could be deallocated: Determine the number of simultaneous RMIDs
> >>> supported, reduce the effective number of RMIDs available to that
> >>> number. Use the default RMID (0) for all "unassigned" monitoring
> >>
> >> hmmm ... so on the one side there is "only the RMID within the PQR
> >> register can be guaranteed to be tracked by hardware" and on the
> >> other side there is "A given implementation may have insufficient
> >> hardware to simultaneously track the bandwidth for all RMID values
> >> that the hardware supports."
> >>
> >> From the above there seems to be something in the middle where
> >> some subset of the RMID values supported by hardware can be used
> >> to simultaneously track bandwidth? How can it be determined
> >> what this number of RMID values is?
> >
> > In the context of AMD, we could use the smallest number of CPUs in any
> > L3 domain as a lower bound of the number of counters.
>
> Could you please elaborate on this? (With the numbers of CPUs nowadays this
> may be many RMIDs, perhaps even more than what ABMC supports.)

I think the "In the context of AMD" part is key. This feature would only
be applicable to the AMD implementations we have today which do not
implement ABMC.  I believe the difficulties are unique to the topologies
of these systems: many small L3 domains per node with a relatively small
number of CPUs in each. If the L3 domains were large and few, simply
restricting the number of RMIDs and allocating on group creation as we
do today would probably be fine.

> I am missing something here since it is not obvious to me how this lower
> bound is determined. Let's assume that there are as many monitor groups
> (and thus as many assigned RMIDs) as there are CPUs in a L3 domain.
> Each monitor group may have many tasks. It can be expected that at any
> moment in time only a subset of assigned RMIDs are assigned to CPUs
> via the CPUs' PQR registers. Of those RMIDs that are not assigned to
> CPUs, how can it be certain that they continue to be tracked by hardware?

Are you asking whether the counters will ever be reclaimed proactively?
The behavior I've observed is that writing a new RMID into a PQR_ASSOC
register when all hardware counters in the domain are allocated will
trigger the reallocation.

However, I admit the wording in the PQoS spec[1] is only written to
support the permanent-assignment workaround in the current patch series:

"All RMIDs which are currently in use by one or more processors in the
QOS domain will be tracked. The hardware will always begin tracking a
new RMID value when it gets written to the PQR_ASSOC register of any of
the processors in the QOS domain and it is not already being tracked.
When the hardware begins tracking an RMID that it was not previously
tracking, it will clear the QM_CTR for all events in the new RMID."

I would need to confirm whether this is the case and request the
documentation be clarified if it is.

> >>>
> >>> While the second feature is a lot more disruptive at the filesystem
> >>> layer, it does eliminate the added context switch overhead. Also, it
> >>
> >> Which changes to filesystem layer are you anticipating?
> >
> > Roughly speaking...
> >
> > 1. The proposed "assign" interface would have to become more indirect
> > to avoid understanding how assign could be implemented on various
> > platforms.
>
> It is almost starting to sound like we could learn from the tracing
> interface where individual events can be enabled/disabled ... with several
> events potentially enabled with an "enable" done higher in hierarchy, perhaps
> even globally to support the first approach ...

Sorry, can you clarify the part about the tracing interface? Tracing to
support dynamic autoconfiguration of events?

Thanks!
-Peter

 [1] AMD64 Technology Platform Quality of Service Extensions, Revision: 1.03:
     https://bugzilla.kernel.org/attachment.cgi?id=301365