Re: [PATCH v6] zswap: memcontrol: implement zswap writeback disabling

Fabian Deutsch <fdeutsch@xxxxxxxxxx> · Thu, 14 Dec 2023 19:00:31 +0100

On Thu, Dec 14, 2023 at 6:24 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
On Thu, Dec 14, 2023 at 10:11 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

>

> On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote:

> > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote:

> > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:

> > > > I can give you three usage cases right now:

> > > > 1) Google producting kernel uses SSD only swap, it is currently on

> > > > pilot. This is not expressible by the memory.zswap.writeback. You can

> > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD

> > > > backed swapfile. But the whole thing feels very clunky, especially

> > > > what you really want is SSD only swap, you need to do all this zswap

> > > > config dance. Google has an internal memory.swapfile feature

> > > > implemented per cgroup swap file type by "zswap only", "real swap file

> > > > only", "both", "none" (the exact keyword might be different). running

> > > > in the production for almost 10 years. The need for more than zswap

> > > > type of per cgroup control is really there.

> > >

> > > We use regular swap on SSD without zswap just fine. Of course it's

> > > expressible.

> > >

> > > On dedicated systems, zswap is disabled in sysfs. On shared hosts

> > > where it's determined based on which workload is scheduled, zswap is

> > > generally enabled through sysfs, and individual cgroup access is

> > > controlled via memory.zswap.max - which is what this knob is for.

> > >

> > > This is analogous to enabling swap globally, and then opting

> > > individual cgroups in and out with memory.swap.max.

> > >

> > > So this usecase is very much already supported, and it's expressed in

> > > a way that's pretty natural for how cgroups express access and lack of

> > > access to certain resources.

> > >

> > > I don't see how memory.swap.type or memory.swap.tiers would improve

> > > this in any way. On the contrary, it would overlap and conflict with

> > > existing controls to manage swap and zswap on a per-cgroup basis.

> > >

> > > > 2) As indicated by this discussion, Tencent has a usage case for SSD

> > > > and hard disk swap as overflow.

> > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@xxxxxxxxx/

> > > > +Kairui

> > >

> > > Multiple swap devices for round robin or with different priorities

> > > aren't new, they have been supported for a very, very long time. So

> > > far nobody has proposed to control the exact behavior on a per-cgroup

> > > basis, and I didn't see anybody in this thread asking for it either.

> > >

> > > So I don't see how this counts as an obvious and automatic usecase for

> > > memory.swap.tiers.

> > >

> > > > 3) Android has some fancy swap ideas led by those patches.

> > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@xxxxxxxxxx/

> > > > It got shot down due to removal of frontswap. But the usage case and

> > > > product requirement is there.

> > > > +Minchan

> > >

> > > This looks like an optimization for zram to bypass the block layer and

> > > hook directly into the swap code. Correct me if I'm wrong, but this

> > > doesn't appear to have anything to do with per-cgroup backend control.

> >

> > Hi Johannes,

> >

> > I haven't been following the thread closely, but I noticed the discussion

> > about potential use cases for zram with memcg.

> >

> > One interesting idea I have is to implement a swap controller per cgroup.

> > This would allow us to tailor the zram swap behavior to the specific needs of

> > different groups.

> >

> > For example, Group A, which is sensitive to swap latency, could use zram swap

> > with a fast compression setting, even if it sacrifices some compression ratio.

> > This would prioritize quick access to swapped data, even if it takes up more space.

> >

> > On the other hand, Group B, which can tolerate higher swap latency, could benefit

> > from a slower compression setting that achieves a higher compression ratio.

> > This would maximize memory efficiency at the cost of slightly slower data access.

> >

> > This approach could provide a more nuanced and flexible way to manage swap usage

> > within different cgroups.

>

> That makes sense to me.

>

> It sounds to me like per-cgroup swapfiles would be the easiest

> solution to this.

Someone posted it about 10 years ago :)

https://lwn.net/Articles/592923/

+fdeutsch@xxxxxxxxxx

Fabian recently asked me about its status.

Yep - for container use-cases.

Now a few thoughts in this direction:
- With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large).
- With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap).
- the above approach however will still have a shared swap backend for all cgroups.