This patch series implements a fine-grained metric for memory health. It builds on top of the refault detection code to quantify the time lost on VM events that occur exclusively due a lack of memory and maps it into a percentage of lost walltime for the system and cgroups. Rationale When presented with a Linux system or container executing a workload, it's hard to judge the health of its memory situation. The statistics exported by the memory management subsystem can reveal smoking guns: page reclaim activity, major faults and refaults can be indicative of an unhealthy memory situation. But they don't actually quantify the cost a memory shortage imposes on the system or workload. How bad is it when 2000 pages are refaulting each second? If the data is stored contiguously on a fast flash drive, it might be okay. If the data is spread out all over a rotating disk, it could be a problem - unless the CPUs are still fully utilized, in which case adding memory wouldn't make things move faster, but instead wait for CPU time. A previous attempt to provide a health signal from the VM was the vmpressure interface, 70ddf637eebe ("memcg: add memory.pressure_level events"). This derives its pressure levels from recently observed reclaim efficiency. As pages are scanned but not reclaimed, the ratio is translated into levels of low, medium, and critical pressure. However, the vmpressure scale is too coarse for today's systems. The accuracy relies on storage being relatively slow compared to how fast the CPU can go through the LRUs, so that when LRU scan cycles outstrip IO completion rates the reclaim code runs into pages that are still reading from disk. But as solid state devices close this speed gap, and memory sizes are in the hundreds of gigabytes, this effect has almost completely disappeared. By the time the reclaim scanner runs into in-flight pages, the tasks in the system already spend a significant part of their runtime waiting for refaulting pages. The vmpressure range is compressed into the split second before OOM and misses large, practically relevant parts of the pressure spectrum. Knowing the exact time penalty that the kernel's paging activity is imposing on a workload is a powerful tool. It allows users to finetune a workload to available memory, but also detect and quantify minute regressions and improvements in the reclaim and caching algorithms. Structure The first patch cleans up the different loadavg callsites and macros as the memdelay averages are going to be tracked using these. The second patch adds a distinction between page cache transitions (inactive list refaults) and page cache thrashing (active list refaults), since only the latter are unproductive refaults. The third patch finally adds the memdelay accounting and interface: its scheduler side identifies productive and unproductive task states, and the VM side aggregates them into system and cgroup domain states and calculates moving averages of the time spent in each state. arch/powerpc/platforms/cell/spufs/sched.c | 3 - arch/s390/appldata/appldata_os.c | 4 - drivers/cpuidle/governors/menu.c | 4 - fs/proc/array.c | 8 + fs/proc/base.c | 2 + fs/proc/internal.h | 2 + fs/proc/loadavg.c | 3 - include/linux/cgroup.h | 14 ++ include/linux/memcontrol.h | 14 ++ include/linux/memdelay.h | 174 +++++++++++++++++ include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/sched.h | 10 +- include/linux/sched/loadavg.h | 3 + include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + kernel/cgroup/cgroup.c | 4 +- kernel/debug/kdb/kdb_main.c | 7 +- kernel/fork.c | 4 + kernel/sched/Makefile | 2 +- kernel/sched/core.c | 20 ++ kernel/sched/memdelay.c | 112 +++++++++++ mm/Makefile | 2 +- mm/compaction.c | 4 + mm/filemap.c | 18 +- mm/huge_memory.c | 1 + mm/memcontrol.c | 25 +++ mm/memdelay.c | 289 ++++++++++++++++++++++++++++ mm/migrate.c | 2 + mm/page_alloc.c | 11 +- mm/swap_state.c | 1 + mm/vmscan.c | 10 + mm/vmstat.c | 1 + mm/workingset.c | 98 ++++++---- 34 files changed, 792 insertions(+), 69 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>