On Thu, 22 Aug 2024, Matt Fleming wrote: > I'm seeing page allocation failures across the Cloudflare fleet, > typically during the network RX path, when trying to allocate order-0 > pages in interrupt context. The machines appear to be under memory > pressure because the code that gets interrupted is > shrink_folio_list(). Below is an example stacktrace. > > Does anyone have any pointers on how to dig into this some more? It > appears as though the machines are not able to reclaim memory fast > enough when under pressure. Happy to provide more metrics or stats on > request. Look at the full kernel log output until the time of the allocation failure? It looks like there is enough memory in every zone to satify the allocation request. Stacktrace looks like memory is pushed out via kswapd and zram to somewhere and then we get interrupted by incoming network traffic.