On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote: > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > I can give you three usage cases right now: > > > 1) Google producting kernel uses SSD only swap, it is currently on > > > pilot. This is not expressible by the memory.zswap.writeback. You can > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > > backed swapfile. But the whole thing feels very clunky, especially > > > what you really want is SSD only swap, you need to do all this zswap > > > config dance. Google has an internal memory.swapfile feature > > > implemented per cgroup swap file type by "zswap only", "real swap file > > > only", "both", "none" (the exact keyword might be different). running > > > in the production for almost 10 years. The need for more than zswap > > > type of per cgroup control is really there. > > > > We use regular swap on SSD without zswap just fine. Of course it's > > expressible. > > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > > where it's determined based on which workload is scheduled, zswap is > > generally enabled through sysfs, and individual cgroup access is > > controlled via memory.zswap.max - which is what this knob is for. > > > > This is analogous to enabling swap globally, and then opting > > individual cgroups in and out with memory.swap.max. > > > > So this usecase is very much already supported, and it's expressed in > > a way that's pretty natural for how cgroups express access and lack of > > access to certain resources. > > > > I don't see how memory.swap.type or memory.swap.tiers would improve > > this in any way. On the contrary, it would overlap and conflict with > > existing controls to manage swap and zswap on a per-cgroup basis. > > > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > > and hard disk swap as overflow. > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@xxxxxxxxx/ > > > +Kairui > > > > Multiple swap devices for round robin or with different priorities > > aren't new, they have been supported for a very, very long time. So > > far nobody has proposed to control the exact behavior on a per-cgroup > > basis, and I didn't see anybody in this thread asking for it either. > > > > So I don't see how this counts as an obvious and automatic usecase for > > memory.swap.tiers. > > > > > 3) Android has some fancy swap ideas led by those patches. > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@xxxxxxxxxx/ > > > It got shot down due to removal of frontswap. But the usage case and > > > product requirement is there. > > > +Minchan > > > > This looks like an optimization for zram to bypass the block layer and > > hook directly into the swap code. Correct me if I'm wrong, but this > > doesn't appear to have anything to do with per-cgroup backend control. > > Hi Johannes, > > I haven't been following the thread closely, but I noticed the discussion > about potential use cases for zram with memcg. > > One interesting idea I have is to implement a swap controller per cgroup. > This would allow us to tailor the zram swap behavior to the specific needs of > different groups. > > For example, Group A, which is sensitive to swap latency, could use zram swap > with a fast compression setting, even if it sacrifices some compression ratio. > This would prioritize quick access to swapped data, even if it takes up more space. > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > from a slower compression setting that achieves a higher compression ratio. > This would maximize memory efficiency at the cost of slightly slower data access. > > This approach could provide a more nuanced and flexible way to manage swap usage > within different cgroups. That makes sense to me. It sounds to me like per-cgroup swapfiles would be the easiest solution to this. Then you can create zram devices with different configurations and assign them to individual cgroups. This would also apply to Kairu's usecase: assign zrams and hdd backups as needed on a per-cgroup basis. In addition, it would naturally solve scalability and isolation problems when multiple containers would otherwise be hammering on the same swap backends and locks. It would also only require one, relatively simple new interface, such as a cgroup parameter to swapon(). That's highly preferable over a complex configuration file like memory.swap.tiers that needs to solve all sorts of visibility and namespace issues and duplicate the full configuration interface of every backend in some new, custom syntax.