Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Thu, 6 Jun 2024 11:03:13 -0700

On Thu, Jun 6, 2024 at 10:55 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>
> On Thu, Jun 6, 2024 at 11:42 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Thu, Jun 6, 2024 at 10:14 AM Takero Funaki <flintglass@xxxxxxxxx> wrote:
> > >
> > > 2024年6月6日(木) 8:42 Yosry Ahmed <yosryahmed@xxxxxxxxxx>:
> > >
> > > > I think there are multiple ways to go forward here:
> > > > (a) Make the number of zpools a config option, leave the default as
> > > > 32, but allow special use cases to set it to 1 or similar. This is
> > > > probably not preferable because it is not clear to users how to set
> > > > it, but the idea is that no one will have to set it except special use
> > > > cases such as Erhard's (who will want to set it to 1 in this case).
> > > >
> > > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > > approach is that with a large number of CPUs, too many zpools will
> > > > start having diminishing returns. Fragmentation will keep increasing,
> > > > while the scalability/concurrency gains will diminish.
> > > >
> > > > (c) Make the number of zpools scale logarithmically with the number of
> > > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > > of zpools from increasing too much and close to the status quo. The
> > > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > > >
> > >
> > > I just posted a patch to limit the number of zpools, with some
> > > theoretical background explained in the code comments. I believe that
> > > 2 * CPU linearly is sufficient to reduce contention, but the scale can
> > > be reduced further. All CPUs are trying to allocate/free zswap is
> > > unlikely to happen.
> > >  How many concurrent accesses were the original 32 zpools supposed to
> > > handle? I think it was for 16 cpu or more. or nr_cpus/4 would be
> > > enough?
> >
> > We use 32 zpools on machines with 100s of CPUs. Two zpools per CPU is
> > an overkill imo.
>
> Not to choose a camp; just a friendly note on why I strongly disagree
> with the N zpools per CPU approach:
> 1. It is fundamentally flawed to assume the system is linear;
> 2. Nonlinear systems usually have diminishing returns.
>
> For Google data centers, using nr_cpus as the scaling factor had long
> passed the acceptable ROI threshold. Per-CPU data, especially when
> compounded per memcg or even per process, is probably the number-one
> overhead in terms of DRAM efficiency.

100% agreed. If you look at option (b) above, I specifically called
out that scaling the number of zpools linearly with the number with
CPUs have diminishing returns :)