Hi Reinette, On Fri, May 12, 2023 at 3:25 PM Peter Newman <peternewman@xxxxxxxxxx> wrote: > On Thu, May 11, 2023 at 11:37 PM Reinette Chatre > <reinette.chatre@xxxxxxxxx> wrote: > > On 4/21/2023 7:17 AM, Peter Newman wrote: > > > + > > > + if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) { > > > + counter = &state->local; > > > + } else { > > > + WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID); > > > + counter = &state->total; > > > + } > > > + > > > + /* > > > + * Propagate the value read from the hw_rmid assigned to the current CPU > > > + * into the "soft" rmid associated with the current task or CPU. > > > + */ > > > + m = get_mbm_state(d, soft_rmid, evtid); > > > + if (!m) > > > + return; > > > + > > > + if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val)) > > > + return; > > > + > > > > This all seems unsafe to run without protection. The code relies on > > the rdt_domain but a CPU hotplug event could result in the domain > > disappearing underneath this code. The accesses to the data structures > > also appear unsafe to me. Note that resctrl_arch_rmid_read() updates > > the architectural MBM state and this same state can be updated concurrently > > in other code paths without appropriate locking. > > The domain is supposed to always be the current one, but I see that > even a get_domain_from_cpu(smp_processor_id(), ...) call needs to walk > a resource's domain list to find a matching entry, which could be > concurrently modified when other domains are added/removed. > > Similarly, when soft RMIDs are enabled, it should not be possible to > call resctrl_arch_rmid_read() outside of on the current CPU's HW RMID. > > I'll need to confirm whether it's safe to access the current CPU's > rdt_domain in an atomic context. If it isn't, I assume I would have to > arrange all of the state used during flush to be per-CPU. > > I expect the constraints on what data can be safely accessed where is > going to constrain how the state is ultimately arranged, so I will > need to settle this before I can come back to the other questions > about mbm_state. According to cpu_hotplug.rst, the startup callbacks are called before a CPU is started and the teardown callbacks are called after the CPU has become dysfunctional, so it should always be safe for a CPU to access its own data, so all I need to do here is avoid walking domain lists in resctrl_mbm_flush_cpu(). However, this also means that resctrl_{on,off}line_cpu() call clear_closid_rmid() on a different CPU, so whichever CPU executes these will zap its own pqr_state struct and PQR_ASSOC MSR. -Peter