Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Thu, 6 Jun 2024 09:42:21 -0700

On Thu, Jun 6, 2024 at 6:28 AM Erhard Furtner <erhard_f@xxxxxxxxxxx> wrote:
>
> On Wed, 5 Jun 2024 16:58:11 -0700
> Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> > On Wed, Jun 5, 2024 at 4:53 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 5:42 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <erhard_f@xxxxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, 4 Jun 2024 20:03:27 -0700
> > > > > Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > >
> > > > > > Could you check if the attached patch helps? It basically changes the
> > > > > > number of zpools from 32 to min(32, nr_cpus).
> > > > >
> > > > > Thanks! The patch does not fix the issue but it helps.
> > > > >
> > > > > Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
> > > > >
> > > > > Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
> > > >
> > > > Thanks for trying this out. This is interesting, so even two zpools is
> > > > too much fragmentation for your use case.
> > >
> > > Now I'm a little bit skeptical that the problem is due to fragmentation.
> > >
> > > > I think there are multiple ways to go forward here:
> > > > (a) Make the number of zpools a config option, leave the default as
> > > > 32, but allow special use cases to set it to 1 or similar. This is
> > > > probably not preferable because it is not clear to users how to set
> > > > it, but the idea is that no one will have to set it except special use
> > > > cases such as Erhard's (who will want to set it to 1 in this case).
> > > >
> > > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > > approach is that with a large number of CPUs, too many zpools will
> > > > start having diminishing returns. Fragmentation will keep increasing,
> > > > while the scalability/concurrency gains will diminish.
> > > >
> > > > (c) Make the number of zpools scale logarithmically with the number of
> > > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > > of zpools from increasing too much and close to the status quo. The
> > > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > > >
> > > > (d) Make the number of zpools scale linearly with memory. This makes
> > > > more sense than scaling with CPUs because increasing the number of
> > > > zpools increases fragmentation, so it makes sense to limit it by the
> > > > available memory. This is also more consistent with other magic
> > > > numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
> > > >
> > > > The problem is that unlike zswap trees, the zswap pool is not
> > > > connected to the swapfile size, so we don't have an indication for how
> > > > much memory will be in the zswap pool. We can scale the number of
> > > > zpools with the entire memory on the machine during boot, but this
> > > > seems like it would be difficult to figure out, and will not take into
> > > > consideration memory hotplugging and the zswap global limit changing.
> > > >
> > > > (e) A creative mix of the above.
> > > >
> > > > (f) Something else (probably simpler).
> > > >
> > > > I am personally leaning toward (c), but I want to hear the opinions of
> > > > other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
> > >
> > > I double checked that commit and didn't find anything wrong. If we are
> > > all in the mood of getting to the bottom, can we try using only 1
> > > zpool while there are 2 available? I.e.,
> >
> > Erhard, do you mind checking if Yu's diff below to use a single zpool
> > fixes the problem completely? There is also an attached patch that
> > does the same thing if this is easier to apply for you.
>
> No, setting ZSWAP_NR_ZPOOLS to 1 does not fix the problem unfortunately (that being the only patch applied on v6.10-rc2).

This confirms Yu's theory that the zpools fragmentation is not the
main reason for the problem. As Vlastimil said, the setup is already
tight on memory and that commit may have just pushed it over the edge.
Since setting ZSWAP_NR_ZPOOLS to 1 (which effectively reverts the
commit) does not help in v6.10-rc2, then something else that came
after the commit would have pushed it over the edge anyway.

>
> Trying to alter the lowmem and virtual mem limits next as Michael suggested.

I saw that this worked. So it seems like we don't need to worry about
the number of zpools, for now at least :)

Thanks for helping with the testing, and thanks to everyone else who
helped on this thread.

>
> Regards,
> Erhard