Hello, It's bothering that we don't have a good mechanism to detect and expose memory pressure and it doesn't seem to be for want of trying. I've been thinking about it for several days and would like to find out whether it makes sense. Not being a mm person, I'm likely mistaken in a lot of details, if not the core concept. Please point out whenever I wander into the lala land. 1. Background AFAIK, there currently are two metrics in use. One is scan ratio - how many pages are being scanned for reclaim per unit time. This is what paces the different reclaimers. While it is related to memory pressure, it involves enough of other factors to be useful as a measure of pressure - e.g. a high-bandwidth streaming workload would cause high scan ratio but can't be said to be under memory pressure. The other metric measures how much reclaimer is struggling - the ratio of the number of pages which couldn't be reclaimed against total scanned. As memory requirement rises, more and more would try to allocate memory making reclaimer cycle faster and encounter more pages which can't be reclaimed yet. This actually is a measure of pressure and what's exposed by vmpressure. However, what a specific number means is unclear. The system would be struggling if 95% of pages couldn't be reclaimed but what would 20% mean? How about 15%? Also, depending on how often the reclaimer does its rounds, which in large part is dependent on implementation details, the resulting number can be widely different while the system is showing about the same behavior overall. These are why vmpressure is exposed as a three-level state but even with such coarse scale, the semantics isn't clear and too much is left to implementation details. There also have been other attempts. Vladimir recently posted a patchset which allows userland to monitor physical page usage - userland can clear used state on pages and then read it back after a while so that it can tell which pages have been used inbetween. While this allows detecting unused pages, this doesn't represent memory pressure. If we go back to the streamer example above, high-enough bandwidth streamer would happily consume all the memory that's given to it and looking at used bits won't tell much. 2. Memory pressure and sizing problem Memory pressure sounds intuitive but isn't actually that well defined. It's closely tied to answering the sizing problem - "how much memory does a given workload need to run smoothly?" and answering that has become more important with cgroup and containerization. Sizing is inherently tricky because mm only tracks a small portion of all memory accesses. We can't tell "this process referenced this page twice, 2mins and 18secs ago". Deciding whether a page is actively used costs and we only track enough information to make reclaim work reasonably. Consequently, it's impossible to tell the working set size of a workload without subjecting it to memory reclaim. Once a workload is subject to memory reclaim, we need a way to tell whether it needs more memory and that measure can be called memory pressure - a workload is under memory pressure if it needs more memory to execute smoothly. More precisely, I think it can be defined as the following. 3. Memory pressure In a given amount of time, the proportion of time that the execution duration of the workload has been lengthened due to lack of memory. I used "lengthened" instead of "delayed" because delays don't necessarily indicate that the workload can benefit from more memory. Consider the following execution pattern where '_' is idle and '+' is running. Let's assume that each CPU burst is caused by receiving a request over network. W0: ++____++____++____ Let's say the workload now needs to fault in some memory during each CPU burst - represented by 'm'. W1: +mm+__+mm+__+mm+__ While delays due to memory shortage have occurred, the duration of the execution stayed the same. In terms of amount of work done, the workload wouldn't have benefited from more memory. Now, consider the following workload where 'i' represents direct IO from filesystem. W2: ++iiii++iiii++iiii If the workload experiences the same page faults, W3: +mm+iiii+mm+iiii+mm+iiii The entire workload has been lengthened by 6 slots or 25%. According to the above definition, it's under 25% memory pressure. This is a well-defined metric which doesn't depend on implementation details and enables immediate intuitive understanding of the current state. 4. Inherent limitations It'd be great if we can calculate the proportion exactly; unfortunately, we can't because there are things we don't know about the workload. Let's go back to W2. Let's say it now accesses more memory and thus causes extra faults before each CPU burst. W4: mm+mm+mm+mm+mm+mm+ It still hasn't lost any amount of work; however, we can no longer tell that whether the pattern without faults would be W0 or W5: ++++++++++++++++++ Another issue is that W0's idle periods may not be collapsible. For example, if the idle periods are latencies from executing RPC calls, injecting page faults would make it look like the following but it'd be difficult to tell the nature of the idle periods from outside. W6: mm++____mm++____mm++____ So, there are inherent limitations in what can be determined; however, even with the above inaccuracies, the proportion should be able to function as a pretty good indicator of whether the workload is likely to benefit from more memory. Also, the incollapsible idle issue is likely to be much less of a problem for larger scalable workloads. 5. Details We need to make more concessions to make it implementable. I probably missed quite a bit but here are what I've thought about. 5-1. Calculable definition Running average of the proportion of CPU time a workload couldn't consume due to memory shortage in a given time unit. A workload, whether a thread or multiple processes, can be in one of the following states on a CPU. R1) READY or RUNNING R2) IO M1) READY or RUNNING due to memory shortage M2) IO due to memory shortage I ) IDLE R1 and R2 count as the execution time. M1 and M2 are CPU time which couldn't be spent due to memory shortage which gets reset by I and the pressure P in a given time unit is calculated as R = R1 + R2 M = M1 + M2 after the last I P = M / (R + M) 5-2. Telling Ms from Rs If a task is in one of the following states, it is in a M state. * Performing or waiting for memory reclaim. * Reading back pages from swap. * Faulting in pages which have been faulted before. The first two are clear. We can use refault distance for the last. I've flipped on this several times now but currently think that we want to set maximum refault distance beyond which a fault is considered new - e.g. if refault distance is larger than 4x currently available memory, count it as R2 instead of M2. There are several things to consider. * There are practical limits to tracking refault distance. Once the inode itself is reclaimed, we lose track of it. * It likely isn't useful to indicate that more memory would be beneficial for a workload which requires orders of magnitude more memory than available. That workload is either infeasible or simply IO bound. * The larger refault distance gets, the less precise its meaning becomes. A page at 8x distance may really need 8x memory to stay in memory or just 10% more. This does create a boundary artifact although I doubt it's likely to be a problem in practice. If anyone has a better idea, please be my guest. If a workload comprises multiple tasks, if any task is in a R state on a CPU, the workload is in R state on that CPU. If none is in a R state and more than one are in a M, M. If all are in I, I. 5-3. IDLE Theoretically, we can consider all TASK_INTERRUPTIBLE sleeps as IDLE points and clear the accumulated M durations in that time unit; however, ignoring sleeps shorter than certain percentage of the unit time or accumulated R durations is likely to lead to better and more forgiving behavior while not really affecting the effectiveness of the metric. So, what do you think? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>