On Fri 02-06-23 15:57:59, Marcelo Tosatti wrote: > The interruption caused by vmstat_update is undesirable > for certain aplications: > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > The example above shows an additional 7us for the > > oslat -> kworker -> oslat > > switches. In the case of a virtualized CPU, and the vmstat_update > interruption in the host (of a qemu-kvm vcpu), the latency penalty > observed in the guest is higher than 50us, violating the acceptable > latency threshold. I personally find the above problem description insufficient. I have asked several times and only got piece by piece information each time. Maybe there is a reason to be secretive but it would be great to get at least some basic expectations described and what they are based on. E.g. workloads are running on isolated cpus with nohz full mode to shield off any kernel interruption. Yet there are operations that update counters (like mlock, but not mlock alone) that update per cpu counters that will eventually get flushed and that will cause some interference. Now the host/guest transition and intereference. How that happens when the guest is running on an isolated and dedicated cpu? > Skip periodic updates for nohz full CPUs. Any callers who > need precise values should use a snapshot of the per-CPU > counters, or use the global counters with measures to > handle errors up to thresholds (see calculate_normal_threshold). I would rephrase this paragraph. In kernel users of vmstat counters either require the precise value and they are using zone_page_state_snapshot interface or they can live with an imprecision as the regular flushing can happen at arbitrary time and cumulative error can grow (see calculate_normal_threshold). >From that POV the regular flushing can be postponed for CPUs that have been isolated from the kernel interference withtout critical infrastructure ever noticing. Skip regular flushing from vmstat_shepherd for all isolated CPUs to avoid interference with the isolated workload. > Suggested by Michal Hocko. > > Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx> Acked-by: Michal Hocko <mhocko@xxxxxxxx> > > --- > > v2: use cpu_is_isolated (Michal Hocko) > > Index: linux-vmstat-remote/mm/vmstat.c > =================================================================== > --- linux-vmstat-remote.orig/mm/vmstat.c > +++ linux-vmstat-remote/mm/vmstat.c > @@ -28,6 +28,7 @@ > #include <linux/mm_inline.h> > #include <linux/page_ext.h> > #include <linux/page_owner.h> > +#include <linux/sched/isolation.h> > > #include "internal.h" > > @@ -2022,6 +2023,16 @@ static void vmstat_shepherd(struct work_ > for_each_online_cpu(cpu) { > struct delayed_work *dw = &per_cpu(vmstat_work, cpu); > > + /* > + * Skip periodic updates for isolated CPUs. > + * Any callers who need precise values should use > + * a snapshot of the per-CPU counters, or use the global > + * counters with measures to handle errors up to > + * thresholds (see calculate_normal_threshold). > + */ > + if (cpu_is_isolated(cpu)) > + continue; > + > if (!delayed_work_pending(dw) && need_update(cpu)) > queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); > > -- Michal Hocko SUSE Labs