On Tue, Jun 4, 2024 at 10:54 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > On Tue, Jun 4, 2024 at 11:34 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > > > On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <erhard_f@xxxxxxxxxxx> wrote: > > > > > > > > > > On Mon, 3 Jun 2024 16:24:02 -0700 > > > > > Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > Thanks for bisecting. Taking a look at the thread, it seems like you > > > > > > have a very limited area of memory to allocate kernel memory from. One > > > > > > possible reason why that commit can cause an issue is because we will > > > > > > have multiple instances of the zsmalloc slab caches 'zspage' and > > > > > > 'zs_handle', which may contribute to fragmentation in slab memory. > > > > > > > > > > > > Do you have /proc/slabinfo from a good and a bad run by any chance? > > > > > > > > > > > > Also, could you check if the attached patch helps? It makes sure that > > > > > > even when we use multiple zsmalloc zpools, we will use a single slab > > > > > > cache of each type. > > > > > > > > > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied. > > > > > > > > > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then. > > > > > > > > > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless. > > > > > > > > Thanks for trying this out. The patch reduces the amount of wasted > > > > memory due to the 'zs_handle' and 'zspage' caches by an order of > > > > magnitude, but it was a small number to begin with (~250K). > > > > > > > > I cannot think of other reasons why having multiple zsmalloc pools > > > > will end up using more memory in the 0.25GB zone that the kernel > > > > allocations can be made from. > > > > > > > > The number of zpools can be made configurable or determined at runtime > > > > by the size of the machine, but I don't want to do this without > > > > understanding the problem here first. Adding other zswap and zsmalloc > > > > folks in case they have any ideas. > > > > > > Hi Erhard, > > > > > > If it's not too much trouble, could you "grep nr_zspages /proc/vmstat" > > > on kernels before and after the bad commit? It'd be great if you could > > > run the grep command right before the OOM kills. > > > > > > The overall internal fragmentation of multiple zsmalloc pools might be > > > higher than a single one. I suspect this might be the cause. > > > > I thought about the internal fragmentation of pools, but zsmalloc > > should have access to highmem, and if I understand correctly the > > problem here is that we are running out of space in the DMA zone when > > making kernel allocations. > > > > Do you suspect zsmalloc is allocating memory from the DMA zone > > initially, even though it has access to highmem? > > There was a lot of user memory in the DMA zone. So at a point the > highmem zone was full and allocation fallback happened. > > The problem with zone fallback is that recent allocations go into > lower zones, meaning they are further back on the LRU list. This > applies to both user memory and zsmalloc memory -- the latter has a > writeback LRU. On top of this, neither the zswap shrinker nor the > zsmalloc shrinker (compaction) is zone aware. So page reclaim might > have trouble hitting the right target zone. I see what you mean. In this case, yeah I think the internal fragmentation in the zsmalloc pools may be the reason behind the problem. How many CPUs does this machine have? I am wondering if 32 can be an overkill for small machines, perhaps the number of pools should be max(nr_cpus, 32)? Alternatively, the number of pools should scale with the memory size in some way, such that we only increase fragmentation when it's tolerable. > > We can't really tell how zspages are distributed across zones, but the > overall number might be helpful. It'd be great if someone could make > nr_zspages per zone :)