Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 4 Jun 2024 11:01:39 -0700

On Tue, Jun 4, 2024 at 10:54 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>
> On Tue, Jun 4, 2024 at 11:34 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <erhard_f@xxxxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, 3 Jun 2024 16:24:02 -0700
> > > > > Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > >
> > > > > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > > > > have a very limited area of memory to allocate kernel memory from. One
> > > > > > possible reason why that commit can cause an issue is because we will
> > > > > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > > > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > > > > >
> > > > > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > > > > >
> > > > > > Also, could you check if the attached patch helps? It makes sure that
> > > > > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > > > > cache of each type.
> > > > >
> > > > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> > > > >
> > > > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> > > > >
> > > > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
> > > >
> > > > Thanks for trying this out. The patch reduces the amount of wasted
> > > > memory due to the 'zs_handle' and 'zspage' caches by an order of
> > > > magnitude, but it was a small number to begin with (~250K).
> > > >
> > > > I cannot think of other reasons why having multiple zsmalloc pools
> > > > will end up using more memory in the 0.25GB zone that the kernel
> > > > allocations can be made from.
> > > >
> > > > The number of zpools can be made configurable or determined at runtime
> > > > by the size of the machine, but I don't want to do this without
> > > > understanding the problem here first. Adding other zswap and zsmalloc
> > > > folks in case they have any ideas.
> > >
> > > Hi Erhard,
> > >
> > > If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> > > on kernels before and after the bad commit? It'd be great if you could
> > > run the grep command right before the OOM kills.
> > >
> > > The overall internal fragmentation of multiple zsmalloc pools might be
> > > higher than a single one. I suspect this might be the cause.
> >
> > I thought about the internal fragmentation of pools, but zsmalloc
> > should have access to highmem, and if I understand correctly the
> > problem here is that we are running out of space in the DMA zone when
> > making kernel allocations.
> >
> > Do you suspect zsmalloc is allocating memory from the DMA zone
> > initially, even though it has access to highmem?
>
> There was a lot of user memory in the DMA zone. So at a point the
> highmem zone was full and allocation fallback happened.
>
> The problem with zone fallback is that recent allocations go into
> lower zones, meaning they are further back on the LRU list. This
> applies to both user memory and zsmalloc memory -- the latter has a
> writeback LRU. On top of this, neither the zswap shrinker nor the
> zsmalloc shrinker (compaction) is zone aware. So page reclaim might
> have trouble hitting the right target zone.

I see what you mean. In this case, yeah I think the internal
fragmentation in the zsmalloc pools may be the reason behind the
problem.

How many CPUs does this machine have? I am wondering if 32 can be an
overkill for small machines, perhaps the number of pools should be
max(nr_cpus, 32)?

Alternatively, the number of pools should scale with the memory size
in some way, such that we only increase fragmentation when it's
tolerable.

>
> We can't really tell how zspages are distributed across zones, but the
> overall number might be helpful. It'd be great if someone could make
> nr_zspages per zone :)