On Thu 14-05-20 16:21:30, Johannes Weiner wrote: > On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > > > Use swap.high when deciding if swap is full. > > > > > > > > > > > > Please be more specific why. > > > > > > > > > > How about: > > > > > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > > > swap reclaim in a best effort manner. > > > > > > > > This is still way too vague. The crux is why should we treat hard and > > > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > > > note that I am not saying this is wrong. I am asking for a more > > > > detailed explanation mostly because I would bet that somebody > > > > stumbles over this sooner or later. > > > > > > Stumbles in what way? > > > > Reading the code and trying to understand why this particular decision > > has been made. Because it might be surprising that the hard and high > > limits are treated same here. > > I don't quite understand the controversy. I do not think there is any controversy. All I am asking for is a clarification because this is non-intuitive. > The idea behind "swap full" is that as long as the workload has plenty > of swap space available and it's not changing its memory contents, it > makes sense to generously hold on to copies of data in the swap > device, even after the swapin. A later reclaim cycle can drop the page > without any IO. Trading disk space for IO. > > But the only two ways to reclaim a swap slot is when they're faulted > in and the references go away, or by scanning the virtual address space > like swapoff does - which is very expensive (one could argue it's too > expensive even for swapoff, it's often more practical to just reboot). > > So at some point in the fill level, we have to start freeing up swap > slots on fault/swapin. Otherwise we could eventually run out of swap > slots while they're filled with copies of data that is also in RAM. > > We don't want to OOM a workload because its available swap space is > filled with redundant cache. Thanks this is a useful summary. > That applies to physical swap limits, swap.max, and naturally also to > swap.high which is a limit to implement userspace OOM for swap space > exhaustion. > > > > Isn't it expected for the kernel to take reasonable precautions to > > > avoid hitting limits? > > > > Isn't the throttling itself the precautious? How does the swap cache > > and its control via mem_cgroup_swap_full interact here. See? This is > > what I am asking to have explained in the changelog. > > It sounds like we need better documentation of what vm_swap_full() and > friends are there for. It should have been obvious why swap.high - a > limit on available swap space - hooks into it. Agreed. The primary source for a confusion is the naming here. Because vm_swap_full doesn't really try to tell that the swap is full. It merely tries to tell that it is getting full and so duplicated data should be dropped. -- Michal Hocko SUSE Labs