On 2/22/19 6:58 PM, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Perhaps going this direction could also make page cache side-channel attacks harder? Quoting [1]: "On Linux, we are only able to evict pages efficiently because we can trick the page re- placement algorithm into believing our target page would be the best choice for eviction. The reason for this lies in the fact that Linux uses a global page replacement algorithm, i.e., an algorithm which does not distinguish between dif- ferent processes. Global page replacement algorithms have been known for decades to allow one process to perform a denial-of-service on other processes" [1] https://arxiv.org/abs/1901.01161