On Thu 15-01-15 22:54:20, Vinayak Menon wrote: > On 01/14/2015 10:20 PM, Michal Hocko wrote: > >On Wed 14-01-15 17:06:59, Vinayak Menon wrote: > >[...] > >>In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > >>had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > >>returned 92, and GFP_IOFS was set, and this resulted > >>in too_many_isolated returning true. But one of the CPU's > >>pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > >>actual isolated count was zero. As there weren't any more > >>updates to NR_ISOLATED_FILE and vmstat_update deffered work > >>had not been scheduled yet, 7 tasks were spinning in the > >>congestion wait loop for around 4 seconds, in the direct > >>reclaim path. > > > >Not syncing for such a long time doesn't sound right. I am not familiar > >with the vmstat syncing but sysctl_stat_interval is HZ so it should > >happen much more often that every 4 seconds. > > > > Though the interval is HZ, since the vmstat_work is declared as a > deferrable work, IIUC the timer trigger can be deferred to the next > non-defferable timer expiry on the CPU which is in idle. This results > in the vmstat syncing on an idle CPU delayed by seconds. May be in > most cases this behavior is fine, except in cases like this. I am not sure I understand the above because CPU being idle doesn't seem important AFAICS. Anyway I have checked the current code which has changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers V8). Let's CC Christoph (the thread starts here: http://thread.gmane.org/gmane.linux.kernel.mm/127229). __round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we have vmstat_shepherd which waits to be queued and then wait to run. When it runs finally it only queues per-cpu vmstat_work which can also end up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just for queuing. Not even mentioning work item latencies. Especially when workers are overloaded e.g. by fs work items and no additional workers cannot be created e.g. due to memory pressure so they are processed only by the workqueue rescuer. And latencies would grow a lot. We have seen an issue where rescuer had to process thousands of work items because all workers where blocked on memory allocation - see http://thread.gmane.org/gmane.linux.kernel/1816452 - which is mainline already 008847f66c38 (workqueue: allow rescuer thread to do more work.) the patch reduces the latency considerably but it doesn't remove it completely. If the time between two syncs is large then the per-cpu drift might be really large as well. Isn't this going to hurt other places where we rely on stats as well? In this particular case the reclaimers are throttled because they see too many isolated pages which was just a result of the per-cpu drift and small LRU list. The system seems to be under serious memory pressure already because there is basically no file cache. If the reclaimers wouldn't be throttled they would fail the reclaim probably and trigger OOM killer which might help to resolve the situation. Reclaimers are throttled instead. Why cannot we simply update the global counters from vmstat_shepherd directly? Sure we would access remote CPU counters but that wouldn't happen often. That still wouldn't be enough because there is the workqueue latency. So why cannot we do that whole shepherd thing from a kernel thread instead? If the NUMA zone->pageset->pcp thing has to be done locally then it could have per-node kthread rather than being part of the unrelated vmstat code. I might be missing obvious things because I still haven't digested the code completely but this whole thing seems overly complicated and I do not see a good reason for that. If the primary motivation was cpu isolation then kthreads would sound like a better idea because you can play with the explicit affinity from the userspace. > Even in usual cases were the timer triggers in 1-2 secs, is it fine to > let the tasks in reclaim path wait that long unnecessarily when there > isn't any real congestion? But how do we find that out? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>