On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Tue 22-09-20 08:54:25, Shakeel Butt wrote: > > On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > On Mon 21-09-20 10:50:14, Shakeel Butt wrote: > [...] > > > > Let me add one more point. Even if the high limit reclaim is swift, it > > > > can still take 100s of usecs. Most of our jobs are anon-only and we > > > > use zswap. Compressing a page can take a couple usec, so 100s of usecs > > > > in limit reclaim is normal. For latency sensitive jobs, this amount of > > > > hiccups do matters. > > > > > > Understood. But isn't this an implementation detail of zswap? Can it > > > offload some of the heavy lifting to a different context and reduce the > > > general overhead? > > > > > > > Are you saying doing the compression asynchronously? Similar to how > > the disk-based swap triggers the writeback and puts the page back to > > LRU, so the next time reclaim sees it, it will be instantly reclaimed? > > Or send the batch of pages to be compressed to a different CPU and > > wait for the completion? > > Yes. > Adding Minchan, if he has more experience/opinion on async swap on zram/zswap. > [...] > > > > You are right that misconfigured limits can result in problems. But such > > > a configuration should be quite easy to spot which is not the case for > > > targetted reclaim calls which do not leave any footprints behind. > > > Existing interfaces are trying to not expose internal implementation > > > details as much as well. You are proposing a very targeted interface to > > > fine control the memory reclaim. There is a risk that userspace will > > > start depending on a specific reclaim implementation/behavior and future > > > changes would be prone to regressions in workloads relying on that. So > > > effectively, any user space memory reclaimer would need to be tuned to a > > > specific implementation of the memory reclaim. > > > > I don't see the exposure of internal memory reclaim implementation. > > The interface is very simple. Reclaim a given amount of memory. Either > > the kernel will reclaim less memory or it will over reclaim. In case > > of reclaiming less memory, the user space can retry given there is > > enough reclaimable memory. For the over reclaim case, the user space > > will backoff for a longer time. How are the internal reclaim > > implementation details exposed? > > In an ideal world yes. A feedback mechanism will be independent on the > particular implementation. But the reality tends to disagree quite > often. Once we provide a tool there will be users using it to the best > of their knowlege. Very often as a hammer. This is what the history of > kernel regressions and "we have to revert an obvious fix because > userspace depends on an undocumented behavior which happened to work for > some time" has thought us in a hard way. > > I really do not want to deal with reports where a new heuristic in the > memory reclaim will break something just because the reclaim takes > slightly longer or over/under reclaims differently so the existing > assumptions break and the overall balancing from userspace breaks. > > This might be a shiny exception of course. And please note that I am not > saying that the interface is completely wrong or unacceptable. I just > want to be absolutely sure we cannot move forward with the existing API > space that we have. > > So far I have learned that you are primarily working around an > implementation detail in the zswap which is doing the swapout path > directly in the pageout path. Wait how did you reach this conclusion? I have explicitly said that we are not using uswapd like functionality in production. We are using this interface for proactive reclaim and proactive reclaim is not a workaround for implementation detail in the zswap. > That sounds like a very bad reason to add > a new interface. You are right that there are likely other usecases to > like this new interface - mostly to emulate drop_caches - but I believe > those are quite misguided as well and we should work harder to help > them out to use the existing APIs. I am not really understanding your concern specific for the new API. All of your concerns (user expectation of reclaim time or over/under reclaim) are still possible with the existing API i.e. memory.high. > Last but not least the memcg > background reclaim is something that should be possible without a new > interface. So, it comes down to adding more functionality/semantics to memory.high or introducing a new simple interface. I am fine with either of one but IMO convoluted memory.high might have a higher maintenance cost. I can send the patch to add the functionality in the memory.high but I would like to get Johannes's opinion first. Shakeel