[...] Hi, This is an interesting topic for me so I would like to join the conversation. I will be glad if I can be of any help here either in testing PSI, or verifying some scenarios and observation. I have some experience working with low memory embedded devices, like RAM as low as 128MB, 256MB, less than 1GB mostly, with/without Display, DRM/Graphics support. Along with ZRAM as swap space configured as 25% of RAM size. The eMMC storage space is also as low as 4GB or 8GB max. So, I have experienced this sluggishness, hang, OOM kill issues quite a number of times. So, I would like to share my experience and observation here. Recently, I have been exploring the PSI feature on my ARM Qemu/Beagle-Bone environment, so I can share some feedback for this as well. The system sluggish behavior can result from 4 types (specially on smart phone devices): * memory allocation pressure * I/O pressure * Scheduling pressure * Network pressure I think the topic of concern here is: memory pressure. So, I would like to share some thoughts about this. * In my opinion, memory pressure should be internal to the system and not visible to the end users. * The pressure metrics can very from system to system, so its difficult to apply single policy. * I guess this is the time to apply "Machine Learning" and "Artificial Intelligence" into the system :) * The memory pressure starts with how many times and how quickly system is entering the slow-path. Thus slow-path monitoring may give some clue about pressure building in the system. Thus I use to use slow-path-counter. Too much of slow-path in the beginning itself indicates that this system needs to be re-designed. * The system should be avoided to entering slow-path again and again thus avoiding pressure. If this happens then its time to reclaim memory in large chunk, rather than in smaller chunk. May be its time to think about shrink_all_memory() knob in kernel. It can be run as bottom-half processing, may be from cgroups. * Some experiment were done in the past. Interested people can check this paper: http://events17.linuxfoundation.org/sites/events/files/slides/%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf * The system is already behaving sluggish even before it enters oom-kill stage. So, most of the time oom stage is skipped, not occurred, or its just looping around. Thus, some kind of oom-monitoring may help to gather some suspect. Thats the reason I proposed to use something called oom-stall-counter. That means system entering oom, but not possibly oom-kill. If this counter is updated means we assume that system started behaving sluggish. * A oom-kill-counter can also help in determining how much of killing happening in kernel space. Example: If PSI pressure is building up and this counter is not updating... But in any case system-daemon should be avoided from killing. * Some killing policy should be left to user space. So a standard system-daemon (or kthread) should be designed along the line. It should be configured dynamically based on the system and oom-score. As my previous experience, in Tizen, we have used something called: resourced daemon. https://git.tizen.org/cgit/platform/core/system/resourced/tree/src/memory?h=tizen * Instead of static policy there should be something called "Dynamic Low Memory Manager" (DLLM) policy. That is at every stage (slow-path, swapping, compaction-fail, reclaim-fail, oom) some action can be taken. Earlier this event was triggered using vmpressure, but now it can replace with PSI. * Another major culprit with sluggish in the long run is, the system-daemon occupying all of swap space and never releasing it. So, even if the kill applications due to oom, it may not help much. Since daemons will never be killed. So, I proposed something called "Dynamic Swappiness", where swappiness of daemons came be lowered dynamically, while normal application have higher values. In the past I have done several experiments on this, soon I will be publishing a paper on it. * May be it is helpful to understand better, if we start from a very minimal scale (just 64MB to 512MB RAM) with busy-box. If we can tune this perfectly, than large scale will automatically have no issues. With respect to PSI, here are my observations: * PSI memory threshold (10, 60, 300) are too high for an embedded system. I think these settings should be dynamic, or user configurable, or there should be on more entry for 1s or lesser. * PSI memory values are updated after the oom-kill in kernel had already happened, that means sluggish already occurred. So, I have to utilize the "total" field and monitor the difference manually. Like the difference between previous-total and next-total is more than 100ms and rising, then we suspect OOM. * Currently, PSI values are system-wide. That is, after sluggish occurred, it is difficult to predict, which task causes sluggish. So, I was thinking to add new entry to capture task details as well. These are some of my opinion. It may or may not be applicable directly. Further brain-storming or discussions might be required. Regards, Pintu