Hi Johannes, On Fri, Dec 8, 2023 at 8:35 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote: > > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > > > Are we still on board to have this two patch merge together somehow so > > > > we can have > > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > > > > > Thanks > > > > > > > > Chris > > > > > > > > > > Hi Chris, > > > > > > I briefly summarized my recent discussion with Johannes here: > > > > > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@xxxxxxxxxxxxxx/ > > > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > > interface, but the use case is not quite there yet, so it does not > > > make too much sense to build up that heavy machinery now. > > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > > if we do decide to implement it. > > > > I am honestly not convinced by this. There is no heavy machinery here. > > The interface is more generic and extensible, but the implementation > > is roughly the same. Unless we have a reason to think a swap.tiers > > interface may make it difficult to extend this later or will not > > support some use cases, I think we should go ahead with it. If we are > > worried that "tiers" may not accurately describe future use cases, we > > can be more generic and call it swap.types or something. > > I have to disagree. The generic swap types or tiers ideas actually > look pretty far-fetched to me, and there is a lack of convincing > explanation for why this is even a probable direction for swap. It boils down to there being more than just "zswap + SSD" usage cases in other parts of the Linux communities. The need is real and it is just a question of how to get there. > > For example, > > 1. What are the other backends? Where you seem to see a multitude of > backends and arbitrary hierarchies of them, I see compression and > flash, and really not much else. And there is only one reasonable > direction in which to combine those two. I list a few other usage cases here in an earlier email of the same thread. https://lore.kernel.org/linux-mm/CAF8kJuNpnqTM5x1QmQ7h-FaRWVnHBdNGvGvB3txohSOmZhYA-Q@xxxxxxxxxxxxxx/T/#t TL;DR: 1) Google has had an internal memory.swapfile in production for almost 10 years. 2) Tencent uses hard drives as SSD swap overflow. +Kairui. https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@xxxxxxxxx/ 3) Android has more fancy swap usage. +Kimchan https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@xxxxxxxxxx/ You can't imagine such an usage is not the reason to block others for such usage. Please respect other usage cases as well. As for the other backends, the first minimal milestone we can implement is "all" and "zswap", which is functional equivalent to the memory.zswap.writeback. Other common back ends there are SSD and hard drive. The exact keywords and tiers are up for future discussion. I just want to acknowledge the need is there. > > The IOPs and latencies of HDDs and network compared to modern > memory sizes and compute speeds make them for the most part > impractical as paging backends. I don't see that being a problem as lower tiers for very cold swaps. Again, it might not be a usage case for Meta but please respect the usage case for others. > > So I don't see a common third swap backend, let alone a fourth or a > fifth, or a multitude of meaningful ways of combining them... Does not stop others from wanting to use them. > > 2. Even if the usecases were there, enabling this would be a ton of > work and open interface questions: If we can agree this interface is more flexible and covers more usage cases than "zswap.writeback". We can start from the minimal implementation of "zswap" and "all". There is not a ton of work in the first minimal milestone. I send out the minimal patch here: https://lore.kernel.org/linux-mm/ZVrHXJLxvs4_CUxc@xxxxxxxxxx/ > > 1) There is no generic code to transfer pages between arbitrary > backends. True, but it does not have to be there to make "swap.tiers" useful. It can start without a transfer page between backends. It is just a more flexible way to specify what swap the cgroup wants to opt in initially, that is a solid need. Everything else can be done later. > > 2) There is no accepted indirection model where a swap pte can refer > to backends dynamically, in a way that makes migration feasible > at scale. Same as above. > > 3) Arbitrary global strings are somewhat unlikely to be accepted as > a way to configure these hierarchies. It does not need to be an arbitrary string. We have a string to config cgroup.subtree_control for example. I am not saying just borrow the subtree control syntax. But we can do some thing similar to that. > > 4) Backend file paths in a global sysfs file don't work well with > namespacing. The swapfile could be in a container > namespace. Containers are not guaranteed to see /sys. Can you clarify what usage problem you are trying to solve? The containers are typically managed by the container manager's service. The service sees /sys for sure. Do you want a cgroup interface to allow the container usage to self-service the swap file? > > 5) Fixed keywords like "zswap" might not be good enough - what about > compression and backend parameters? I think those levels of detail are zswap/zpool specific, it belongs to the /sys/kernel/mm/zswap/foo_bar for example. We can discuss the detail usage, nothing is set in stone yet. > > None of these are insurmountable. My point is that this would be a > huge amount of prerequisite code and effort for what seems would be a > fringe usecase at best right now. For the fringe usage a minimal patch exists. Not a huge amount of requireste code. What you describe are close to the end goal of the swap tiers. I > > And there could be a lot of curve balls in both the software design as > well as the hardware development between now and then that could make > your proposals moot. Is a per-cgroup string file really going to be > the right way to configure arbitrary hierarchies if they materialize? > > This strikes me as premature and speculative, for what could be, some > day. "swap.tiers" in my mind is strictly better than "zswap.writeback" because writeback has only 0 or 1 value. It is not able to describe other swap selections. Setting zswap.writeback = 0 will disable SSD swap as well, that is not very intuitive. If we want SSD only swap, we need to set "zswap.writeback" = 1 and "zswap.max" = 0. All this makes me feel that we are limiting ourselves too much from the cubical of zswap to look at the rest of the world. > > We don't even do it for *internal API*. There is a review rule to > introduce a function in the same patch as its first caller, to make > sure it's the right abstraction and a good fit for the usecase. There > is no way we can have a lower bar than that for permanent ABI. > > The patch here integrates with what zswap is NOW and always has been: > a compressing writeback cache for swap. > > Should multiple swap tiers overcome all the above and actually become > real, this knob here would be the least of our worries. It would be > easy to just ignore, automatically override, or deprecate. > > So I don't think you made a reasonable proposal for an alternative, or > gave convincing reasons to hold off this one. > Keep in mind that the minimal patch is just trying to avoid the detour of introducing the zswap.writeback and obsolete it soon. There is a solid need for swap backend other than zswap. Frankly speaking I don't see writing "all" to "swap.tiers" is that big a deal different than writing "1" to "zswap.writeback". From the API point of view, one is more flexible and future proof. I just don't want "zswap.writeback" to become something carved into stone and we cant' remove it later, especially if we can have the alternative API compatible with other usage cases. Chris