On Mon 05-06-23 15:56:30, Marcelo Tosatti wrote: > schedule_work_on API uses the workqueue mechanism to > queue a work item on a queue. A kernel thread, which > runs on the target CPU, executes those work items. > > Therefore, when using the schedule_work_on API, > it is necessary for the kworker kernel thread to > be scheduled in, for the work function to be executed. > > Time sensitive applications such as SoftPLCs > (https://tum-esi.github.io/publications-list/PDF/2022-ETFA-How_Real_Time_Are_Virtual_PLCs.pdf), > have their response times affected by such interruptions. > > The /proc/sys/vm/stat_refresh file was originally introduced > with the goal to: > > "Provide /proc/sys/vm/stat_refresh to force an immediate update of > per-cpu into global vmstats: useful to avoid a sleep(2) or whatever > before checking counts when testing. Originally added to work around a > bug which left counts stranded indefinitely on a cpu going idle (an > inaccuracy magnified when small below-batch numbers represent "huge" > amounts of memory), but I believe that bug is now fixed: nonetheless, > this is still a useful knob." > > Other than the potential interruption to a time sensitive application, > if using SCHED_FIFO or SCHED_RR priority on the isolated CPU, then > system hangs can occur: The same thing can happen without isolated CPUs and this patch doesn't help at all. > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=978688 And this is an example of that... > To avoid the problems above, do not schedule the work to synchronize > per-CPU mm counters on isolated CPUs. Given the possibility for > breaking existing userspace applications, avoid returning > errors from access to /proc/sys/vm/stat_refresh. > > Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx> It would be really helpful to not post new versions while discussion of the previous one is still not done. Anyway Nacked-by: Michal Hocko <mhocko@xxxxxxxx> This is silently changing semantic and I do not think you have actually shown this is a real life problem. To me it sounds like a theoretical issue at most and it can be worked around by disalowing to use this interface from userspace. stat_refresh is mostly for debugging purposes and I strongly doubt it is ever used in environments you refer to in this series. > > --- > v3: improve changelog (Michal Hocko) > v2: opencode schedule_on_each_cpu (Michal Hocko) > > Index: linux-vmstat-remote/mm/vmstat.c > =================================================================== > --- linux-vmstat-remote.orig/mm/vmstat.c > +++ linux-vmstat-remote/mm/vmstat.c > @@ -1881,8 +1881,13 @@ int vmstat_refresh(struct ctl_table *tab > void *buffer, size_t *lenp, loff_t *ppos) > { > long val; > - int err; > int i; > + int cpu; > + struct work_struct __percpu *works; > + > + works = alloc_percpu(struct work_struct); > + if (!works) > + return -ENOMEM; > > /* > * The regular update, every sysctl_stat_interval, may come later > @@ -1896,9 +1901,24 @@ int vmstat_refresh(struct ctl_table *tab > * transiently negative values, report an error here if any of > * the stats is negative, so we know to go looking for imbalance. > */ > - err = schedule_on_each_cpu(refresh_vm_stats); > - if (err) > - return err; > + cpus_read_lock(); > + for_each_online_cpu(cpu) { > + struct work_struct *work; > + > + if (cpu_is_isolated(cpu)) > + continue; > + work = per_cpu_ptr(works, cpu); > + INIT_WORK(work, refresh_vm_stats); > + schedule_work_on(cpu, work); > + } > + > + for_each_online_cpu(cpu) { > + if (cpu_is_isolated(cpu)) > + continue; > + flush_work(per_cpu_ptr(works, cpu)); > + } > + cpus_read_unlock(); > + free_percpu(works); > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { > /* > * Skip checking stats known to go negative occasionally. > -- Michal Hocko SUSE Labs