Hi Tony, On 11/15/2023 1:54 PM, Tony Luck wrote: > On Wed, Nov 15, 2023 at 08:09:13AM -0800, Reinette Chatre wrote: >> Hi Tony, >> >> On 11/9/2023 1:27 PM, Luck, Tony wrote: >>>>> Maybe additional an mount option "mba_MBps_total" so the user can pick >>>>> total instead of local? >>>> >>>> Is this something for which a remount is required? Can it not perhaps be >>>> changed at runtime? >>> >>> In theory, yes. But I've been playing with a patch that adds a writable info/ >>> file to allow runtime switch: >>> >>> # ls -l /sys/fs/resctrl/info/MB/mba_MBps_control >>> -rw-r--r--. 1 root root 0 Nov 9 10:57 /sys/fs/resctrl/info/MB/mba_MBps_control >>> ]# cat /sys/fs/resctrl/info/MB/mba_MBps_control >>> total >>> >>> and found that it's a bit tricky to switch out the MBM event from the >>> state machine driving the feedback loop. I think the problem is in the >>> code that tries to stop the control loop from switching between two >>> throttling levels every second: >>> >>> if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { >>> new_msr_val = cur_msr_val - r_mba->membw.bw_gran; >>> } else if (cur_msr_val < MAX_MBA_BW && >>> (user_bw > (cur_bw + delta_bw))) { >>> new_msr_val = cur_msr_val + r_mba->membw.bw_gran; >>> } else { >>> return; >>> } >>> >>> The code drops down one percentage step if current bandwidth is above >>> the desired target. But stepping back up checks to see if "cur_bw + delta_bw" >>> is below the target. >>> >>> Where does "delta_bw" come from? Code uses the Boolean flag "pmbm_data->delta_comp" >>> to request the once-per-second polling compute the change in bandwidth on the >>> next poll after adjusting throttling MSRs. >>> >>> All of these values are in the "struct mbm_state" which is a per-event-id structure. >>> >>> Picking an event at boot time works fine. Likely also fine at mount time. But >>> switching at run-time seems to frequently end up with a very large value in >>> "delta_bw" (as it compares current & previous for this event ... and it looks >>> like things changed from zero). Net effect is that throttling is increased when >>> processes go over their target for the resctrl group, but throttling is never decreased. >> >> This is not clear to me. Would the state not also start from zero at boot and mount >> time? From what I understand the state is also reset to zero on monitor group creation. > > Yes. All of boot, mount, mkdir start a group in a well defined state > with no throttling applied (schemata shows bandwitdh limit as 2^32 > MBytes/sec). If the user sets some realistic limit, and the group > MBM measurement exceeds that limit, then the MBA MSR for the group > is dropped from 100% to 90% and the delta_comp flag set to record > the delta_bw on the next 1-second poll. > > The value of delta_bw is only used when looking to reduce throttling. > To be in that state this group must have been in a state where > throttling was increased ... which would result in delta_bw being > set up. > > Now look at what happens when switching from local to total for the > first time. delta_bw is zero in the structures recording total bandwidth > information. But more importanly so is prev_bw. If the code above > changes throttling value and requests an updated calulation of delta_bw, > that will be done using a value of prev_bw==0. I.e. delta_bw will be > set to the current bandwidth. That high value will likely block attempts > to reduce throttling. Thank you for the detailed explanation. I think there are ways in which to make this transition smoother, for example to not compute delta_bw if there is no history (no "prev_bw_bytes"). But that would just fix the existing algorithm without addressing the other issues you raised with this algorithm. > > Maybe when switching MBM source events the prev_bw value should be > copied from old source structures to new source structures as a rough > guide to avoid crazy actions. But that could also be wrong when > switching from total to local for a group that has poor NUMA > localization and total bandwidth is far higher than local. > >>> The whole heuristic seems a bit fragile. It works well for test processes that have >>> constant memory bandwidth. But I could see it failing in scenarios like this: >>> >>> 1) Process is over MB limit >>> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll >>> 3) Process blocks on some event and uses no bandwidth in next one second >>> 4) Next poll. Linux computes delta_bw as abs(cur_bw - m->prev_bw). cur_bw is zero, >>> so delta_bw is set to full value of bandwidth that process used when over budget >>> 5) Process resumes running >>> 6) Linux sees process using less than target, but cur_bw + delta_bw is above target, >>> so Linux doesn't adjust throttling >>> >>> I think the goal was to avoid relaxing throttling and letting a resctrl group go back over >>> target bandwidth. But that doesn't work either for groups with highly variable bandwidth >>> requirements. >>> >>> 1) Group is over budget >>> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll >>> 3) Group forks additional processes. New bandwidth from those offsets the reduction due to throttling >>> 4) Next poll. Linux sees bandwidth is unchanged. Sets delta_bw = 0. >>> 5) Next poll. Groups aggregate bandwidth is fractionally below target. Because delta_bw=0, Linux >>> reduces throttling. >>> 6) Group goes over target. >>> >> >> I'll defer to you for the history about this algorithm. I am not familiar with how >> broadly this feature is used but I have not heard about issues with it. It does >> seem as though there is some opportunity for investigation here. > > I sure I could construct an artificial test case to force this scenario. > But maybe: > 1) It never happens in real life > 2) It happens, but nobody noticed > 3) People figured out the workaround (set schemata to a really big > MBytes/sec value for a second, and then back to desired value). > 4) Few people use this option > > I dug again into the lore.kernel.org archives. Thomas complained > that is wasn't "calibration" (as Vikas had descibed in in V1) but > seems to have otherwise been OK with it as a heuristic. > > https://lore.kernel.org/all/alpine.DEB.2.21.1804041037090.2056@xxxxxxxxxxxxxxxxxxxxxxx/ > > > I coded up and tested the below patch as a possible replacement heuristic. > But I also wonder whether just letting the feedback loop flip throttling > up and down between throttling values above/below the target bandwidth > would really be so bad. It's just one MSR write that can be done from > the current CPU and would result in average bandwidth closer to the > user requested target. The proposed heuristic seem to assume that the bandwidth used has a linear relationship to the throttling percentage. It seems to set aside the reasons that motivated this "delta_bw" in the first place: > - * This is because (1)the increase in bandwidth is not perfectly > - * linear and only "approximately" linear even when the hardware > - * says it is linear.(2)Also since MBA is a core specific > - * mechanism, the delta values vary based on number of cores used > - * by the rdtgrp. >From the above I understand that reducing throttling by 10% does not imply that bandwidth consumed will increase by 10%. A new heuristic like this may thus decide not to relax throttling expecting that doing so would cause bandwidth to go over limit while the non-linear increase may result in bandwidth consumed not going over limit when throttling is relaxed. I am also curious if only using target bandwidth would be bad. I looked through the spec and was not able to find any information guiding to the cost of adjusting the allocation once per second (per resource group per domain). The closest I could find was the discussion of a need of a "fine-grained software controller" where it is not clear if "once per second" can be considered "fine grained". Reinette