Hi all, We are seeing machine lockups due extreme memory pressure where the free pages on all the zones are way below the min watermarks. The stack of the stuck CPU looks like the following (I had to crash the machine to get the info). #0 [ ] crash_nmi_callback #1 [ ] nmi_handle #2 [ ] default_do_nmi #3 [ ] do_nmi #4 [ ] end_repeat_nmi --- <NMI exception stack> --- #5 [ ] queued_spin_lock_slowpath #6 [ ] _raw_spin_lock #7 [ ] ____cache_alloc_node #8 [ ] fallback_alloc #9 [ ] __kmalloc_node_track_caller #10 [ ] __alloc_skb #11 [ ] tcp_send_ack #12 [ ] tcp_delack_timer #13 [ ] run_timer_softirq #14 [ ] irq_exit #15 [ ] smp_apic_timer_interrupt #16 [ ] apic_timer_interrupt --- <IRQ stack> --- #17 [ ] apic_timer_interrupt #18 [ ] _raw_spin_lock #19 [ ] vmpressure #20 [ ] shrink_node #21 [ ] do_try_to_free_pages #22 [ ] try_to_free_pages #23 [ ] __alloc_pages_direct_reclaim #24 [ ] __alloc_pages_nodemask #25 [ ] cache_grow_begin #26 [ ] fallback_alloc #27 [ ] __kmalloc_node_track_caller #28 [ ] __alloc_skb #29 [ ] tcp_sendmsg_locked #30 [ ] tcp_sendmsg #31 [ ] inet6_sendmsg #32 [ ] ___sys_sendmsg #33 [ ] sys_sendmsg #34 [ ] do_syscall_64 These are high traffic machines. Almost all the CPUs are stuck on the root memcg's vmpressure sr_lock and almost half of the CPUs are stuck on kmem cache node's list_lock in the IRQ. Note that the vmpressure sr_lock is irq-unsafe. Couple of months back, we observed a similar situation with swap locks which forces us to disable swap on global pressure. Since we do proactive reclaim disabling swap on global reclaim was not an issue. However now we have started seeing the same situation with other irq-unsafe locks like vmpressure sr_lock and almost all the slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this is by converting all such locks (which can be taken in reclaim path) to be irq-safe but it does not seem like a maintainable solution. Please note that we are running user space oom-killer which is more aggressive than oomd/PSI but even that got stuck under this much memory pressure. I am wondering if anyone else has seen a similar situation in production and if there is a recommended way to resolve this situation. thanks, Shakeel