Re: order-0 page alloc failures during interrupt context on v6.6.43

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/23/24 00:07, Matt Fleming wrote:
> (Adding Mel to Cc list)
> 
> On Thu, Aug 22, 2024 at 9:02 PM Matt Fleming <mfleming@xxxxxxxxxxxxxx> wrote:
>>
>> Hey there,
>>
>> I'm seeing page allocation failures across the Cloudflare fleet,
>> typically during the network RX path, when trying to allocate order-0
>> pages in interrupt context. The machines appear to be under memory
>> pressure because the code that gets interrupted is
>> shrink_folio_list(). Below is an example stacktrace.
>>
>> Does anyone have any pointers on how to dig into this some more? It
>> appears as though the machines are not able to reclaim memory fast
>> enough when under pressure. Happy to provide more metrics or stats on
>> request.
>>
>> Thanks,
>> Matt
>>
>> ----8<----
>>
>> kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC),
>> nodemask=(null),cpuset=/,mems_allowed=0-7
>> CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G           O
>>    6.6.43-CUSTOM #1
>> Hardware name: MACHINE
>> Call Trace:
>>  <IRQ>
>>  dump_stack_lvl+0x3c/0x50
>>  warn_alloc+0x13a/0x1c0
>>  __alloc_pages_slowpath.constprop.0+0xc9d/0xd10
>>  ? srso_alias_return_thunk+0x5/0xfbef5
>>  ? __alloc_pages_bulk+0x3a0/0x630
>>  __alloc_pages+0x327/0x340
>>  __napi_alloc_skb+0x16d/0x1f0
>>  bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en]
>>  bnxt_rx_pkt+0x201/0x15e0 [bnxt_en]
>>  ? skb_release_data+0x14f/0x1b0
>>  __bnxt_poll_work+0x156/0x2b0 [bnxt_en]
>>  bnxt_poll+0xd9/0x1c0 [bnxt_en]
>>  ? srso_alias_return_thunk+0x5/0xfbef5
>>  __napi_poll+0x2b/0x1b0
>>  bpf_trampoline_6442524138+0x7d/0x1000
>>  __napi_poll+0x5/0x1b0
>>  net_rx_action+0x342/0x740
>>  ? srso_alias_return_thunk+0x5/0xfbef5
>>  handle_softirqs+0xcf/0x2b0
>>  irq_exit_rcu+0x6c/0x90
>>  sysvec_apic_timer_interrupt+0x72/0x90
>>  </IRQ>
>>  <TASK>
>>  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> RIP: 0010:queued_spin_lock_slowpath+0x260/0x2b0
>> Code: 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 c0 30 03 00 48 03
>> 04 d5 a0 d7 10 9c 48 89 28 8b 45 08 85 c0 75 09 f3 90 8b 45 08 <85> c0
>> 74 f7 48 8b 55 00 48 85 d2 74 83 0f 0d 0a e9 7b ff ff ff 65
>> RSP: 0018:ffffc9000f9cb768 EFLAGS: 00000246
>> RAX: 0000000000000000 RBX: ffff88905a3a9880 RCX: 0000000000000001
>> RDX: 000000000000001b RSI: 0000000000700000 RDI: ffff88905a3a9880
>> RBP: ffff88902f5330c0 R08: ffffc9000f9cb750 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000603fce623320 R12: 00000000002c0000
>> R13: 0000000000000001 R14: 00000000002c0000 R15: ffff889062f84a00
>>  zs_malloc+0x9d/0x520 [zsmalloc]
>>  ? srso_alias_return_thunk+0x5/0xfbef5
>>  ? __zstd_compress+0x60/0xa0 [zstd]
>>  zram_submit_bio+0x8d1/0x9f0 [zram]
>>  ? srso_alias_return_thunk+0x5/0xfbef5
>>  __submit_bio+0xaa/0x160
>>  submit_bio_noacct_nocheck+0x145/0x380
>>  ? submit_bio_noacct+0x24/0x4c0
>>  submit_bio_wait+0x5b/0xc0
>>  swap_writepage_bdev_sync+0xf8/0x170
>>  ? __pfx_submit_bio_wait_endio+0x10/0x10
>>  swap_writepage+0x36/0x80
>>  pageout+0xc8/0x240
>>  shrink_folio_list+0x489/0xd60
>>  shrink_lruvec+0x5a8/0xc40
>>  shrink_node+0x2c5/0x7a0
>>  balance_pgdat+0x32d/0x740
>>  kswapd+0x205/0x400
>>  ? __pfx_autoremove_wake_function+0x10/0x10
>>  ? __pfx_kswapd+0x10/0x10
>>  kthread+0xe8/0x120
>>  ? __pfx_kthread+0x10/0x10
>>  ret_from_fork+0x34/0x50
>>  ? __pfx_kthread+0x10/0x10
>>  ret_from_fork_asm+0x1b/0x30
>>  </TASK>
>> Mem-Info:
>> active_anon:14289951 inactive_anon:25056935 isolated_anon:1577
>>  active_file:3254095 inactive_file:3963476 isolated_file:1
>>  unevictable:4 dirty:305545 writeback:132
>>  slab_reclaimable:2916775 slab_unreclaimable:1689088
>>  mapped:2592762 shmem:1980658 pagetables:530605
>>  sec_pagetables:0 bounce:0
>>  kernel_misc_reclaimable:0
>>  free:618653 free_pcp:129763 free_cma:0
>> Node 0 active_anon:6461468kB inactive_anon:11667080kB
>> active_file:1971908kB inactive_file:2302944kB unevictable:0kB
>> isolated(anon):960kB isolated(file):0kB mapped:1070000kB
>> dirty:110140kB writeback:64kB shmem:842272kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB
>> kernel_stack:37624kB pagetables:235212kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 1 active_anon:7027824kB inactive_anon:12544448kB
>> active_file:1695500kB inactive_file:2093056kB unevictable:0kB
>> isolated(anon):308kB isolated(file):0kB mapped:1694880kB
>> dirty:163436kB writeback:24kB shmem:1090692kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB
>> kernel_stack:31860kB pagetables:231608kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 2 active_anon:7168612kB inactive_anon:11850084kB
>> active_file:1669812kB inactive_file:1870596kB unevictable:0kB
>> isolated(anon):144kB isolated(file):0kB mapped:1420628kB
>> dirty:105912kB writeback:24kB shmem:1092068kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB
>> kernel_stack:40220kB pagetables:263428kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 3 active_anon:7160892kB inactive_anon:12851880kB
>> active_file:1453156kB inactive_file:1884092kB unevictable:0kB
>> isolated(anon):452kB isolated(file):0kB mapped:1199768kB
>> dirty:124548kB writeback:72kB shmem:965128kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:2048kB writeback_tmp:0kB
>> kernel_stack:27124kB pagetables:284676kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 4 active_anon:7505196kB inactive_anon:12764280kB
>> active_file:1466756kB inactive_file:1878740kB unevictable:16kB
>> isolated(anon):640kB isolated(file):0kB mapped:1170484kB
>> dirty:136668kB writeback:44kB shmem:986212kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:2048kB writeback_tmp:0kB
>> kernel_stack:32380kB pagetables:312216kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 5 active_anon:7169752kB inactive_anon:12867040kB
>> active_file:1769832kB inactive_file:1809448kB unevictable:0kB
>> isolated(anon):1008kB isolated(file):0kB mapped:1589272kB
>> dirty:128616kB writeback:112kB shmem:1108816kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB
>> kernel_stack:32784kB pagetables:278392kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 6 active_anon:7333288kB inactive_anon:12854340kB
>> active_file:1504536kB inactive_file:2096488kB unevictable:0kB
>> isolated(anon):1336kB isolated(file):4kB mapped:1117792kB
>> dirty:228512kB writeback:92kB shmem:958680kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB
>> kernel_stack:43852kB pagetables:254060kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 7 active_anon:7332772kB inactive_anon:12828588kB
>> active_file:1484880kB inactive_file:1918540kB unevictable:0kB
>> isolated(anon):1460kB isolated(file):0kB mapped:1108224kB
>> dirty:224348kB writeback:96kB shmem:878764kB shmem_thp:0kB
>> shmem_pmdmapped:0kB anon_thp:2048kB writeback_tmp:0kB
>> kernel_stack:35580kB pagetables:262828kB sec_pagetables:0kB
>> all_unreclaimable? no
>> Node 0 DMA free:11264kB boost:0kB min:48kB low:60kB high:72kB
>> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
>> active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
>> present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB
>> local_pcp:0kB free_cma:0kB
>> lowmem_reserve[]: 0 2095 31529 31529
>> Node 0 DMA32 free:118988kB boost:0kB min:6832kB low:8976kB
>> high:11120kB reserved_highatomic:0KB active_anon:445316kB
>> inactive_anon:780792kB active_file:122148kB inactive_file:151592kB
>> unevictable:0kB writepending:1464kB present:2735864kB
>> managed:2145496kB mlocked:0kB bounce:0kB free_pcp:20468kB
>> local_pcp:48kB free_cma:0kB
>> lowmem_reserve[]: 0 0 29434 29434
>> Node 0 Normal free:266252kB boost:0kB min:95988kB low:126128kB

We're nominally above min watermark (free > min).

>> high:156268kB reserved_highatomic:305152KB active_anon:6016024kB

But when subtracting reserve_highatomic from free, it gets us even below zero?

>> inactive_anon:10884436kB active_file:1849108kB inactive_file:2149856kB
>> unevictable:0kB writepending:108740kB present:30670848kB
>> managed:30141044kB mlocked:0kB bounce:0kB free_pcp:37432kB
>> local_pcp:84kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 1 Normal free:290496kB boost:0kB min:105164kB low:138184kB
>> high:171204kB reserved_highatomic:333824KB active_anon:7028084kB
>> inactive_anon:12543028kB active_file:1694884kB inactive_file:2092728kB
>> unevictable:0kB writepending:163200kB present:33552384kB
>> managed:33022704kB mlocked:0kB bounce:0kB free_pcp:53668kB
>> local_pcp:892kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 2 Normal free:295000kB boost:0kB min:105172kB low:138196kB
>> high:171220kB reserved_highatomic:333824KB active_anon:7168872kB
>> inactive_anon:11848752kB active_file:1668876kB inactive_file:1871016kB
>> unevictable:0kB writepending:106604kB present:33554432kB
>> managed:33024756kB mlocked:0kB bounce:0kB free_pcp:48468kB
>> local_pcp:752kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 3 Normal free:308228kB boost:0kB min:105012kB low:137984kB
>> high:170956kB reserved_highatomic:333824KB active_anon:7164068kB
>> inactive_anon:12847600kB active_file:1453016kB inactive_file:1885952kB
>> unevictable:0kB writepending:126480kB present:33553408kB
>> managed:32974232kB mlocked:0kB bounce:0kB free_pcp:64400kB
>> local_pcp:732kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 4 Normal free:271672kB boost:0kB min:105172kB low:138196kB
>> high:171220kB reserved_highatomic:333824KB active_anon:7505196kB
>> inactive_anon:12763688kB active_file:1465932kB inactive_file:1880212kB
>> unevictable:16kB writepending:137892kB present:33554432kB
>> managed:33024756kB mlocked:16kB bounce:0kB free_pcp:60204kB
>> local_pcp:632kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 5 Normal free:291824kB boost:0kB min:105168kB low:138188kB
>> high:171208kB reserved_highatomic:333824KB active_anon:7169428kB
>> inactive_anon:12866872kB active_file:1769184kB inactive_file:1811512kB
>> unevictable:0kB writepending:131024kB present:33553408kB
>> managed:33023728kB mlocked:0kB bounce:0kB free_pcp:78708kB
>> local_pcp:568kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 6 Normal free:310936kB boost:0kB min:105172kB low:138196kB
>> high:171220kB reserved_highatomic:333824KB active_anon:7333792kB
>> inactive_anon:12852816kB active_file:1503264kB inactive_file:2097500kB
>> unevictable:0kB writepending:229284kB present:33554432kB
>> managed:33024756kB mlocked:0kB bounce:0kB free_pcp:74936kB
>> local_pcp:796kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 7 Normal free:309668kB boost:0kB min:105112kB low:138116kB
>> high:171120kB reserved_highatomic:333824KB active_anon:7331892kB

All of the nodes have the same amount of reserved_highatomic.

>> inactive_anon:12827964kB active_file:1484024kB inactive_file:1920356kB
>> unevictable:0kB writepending:226576kB present:33541120kB
>> managed:33005940kB mlocked:0kB bounce:0kB free_pcp:80748kB
>> local_pcp:704kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
>> 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
>> Node 0 DMA32: 2225*4kB (UME) 338*8kB (UME) 178*16kB (UME) 459*32kB
>> (UME) 215*64kB (UME) 115*128kB (ME) 86*256kB (UME) 35*512kB (UME)
>> 4*1024kB (UM) 6*2048kB (M) 1*4096kB (U) = 118036kB
>> Node 0 Normal: 797*4kB (H) 871*8kB (H) 802*16kB (H) 804*32kB (H)
>> 601*64kB (H) 310*128kB (H) 164*256kB (H) 67*512kB (H) 25*1024kB (H)
>> 14*2048kB (H) 2*4096kB (H) = 265612kB
>> Node 1 Normal: 507*4kB (H) 680*8kB (H) 682*16kB (H) 699*32kB (H)
>> 589*64kB (H) 363*128kB (H) 211*256kB (H) 93*512kB (H) 37*1024kB (H)
>> 13*2048kB (H) 0*4096kB = 291052kB
>> Node 2 Normal: 598*4kB (H) 843*8kB (H) 740*16kB (H) 735*32kB (H)
>> 507*64kB (H) 298*128kB (H) 175*256kB (H) 102*512kB (H) 37*1024kB (H)
>> 21*2048kB (H) 1*4096kB (H) = 297104kB
>> Node 3 Normal: 440*4kB (H) 509*8kB (H) 493*16kB (H) 559*32kB (H)
>> 438*64kB (H) 304*128kB (H) 197*256kB (H) 126*512kB (H) 50*1024kB (H)
>> 21*2048kB (H) 0*4096kB = 307704kB
>> Node 4 Normal: 604*4kB (H) 716*8kB (H) 674*16kB (H) 819*32kB (H)
>> 544*64kB (H) 303*128kB (H) 182*256kB (H) 74*512kB (H) 24*1024kB (H)
>> 20*2048kB (H) 0*4096kB = 268752kB
>> Node 5 Normal: 809*4kB (H) 873*8kB (H) 775*16kB (H) 749*32kB (H)
>> 414*64kB (H) 254*128kB (H) 154*256kB (H) 90*512kB (H) 37*1024kB (H)
>> 31*2048kB (H) 0*4096kB = 292476kB
>> Node 6 Normal: 659*4kB (H) 689*8kB (H) 708*16kB (H) 851*32kB (H)
>> 592*64kB (H) 386*128kB (H) 226*256kB (H) 91*512kB (H) 40*1024kB (H)
>> 13*2048kB (H) 1*4096kB (H) = 310132kB
>> Node 7 Normal: 898*4kB (H) 907*8kB (H) 893*16kB (H) 897*32kB (H)
>> 597*64kB (H) 375*128kB (H) 203*256kB (H) 86*512kB (H) 29*1024kB (H)
>> 20*2048kB (H) 0*4096kB = 306704kB

And (H) everywhere confirms all the free memory in Normal zones is reserved
highatomic.

We have several paths where the reserved highatomic would shrink itself in
response to different allocatiosn struggling , and I recall some recent-ish
fixes in this area. But from a glance it seems none of them would be
relevant and just missing in 6.6 LTS. The "order:0 GFP_ATOMIC" case seems to
be missing a way to dip into the highatomic reserves and perhaps it should?

AFAICS:

- __zone_watermark_unusable_free() for ALLOC_RESERVES (which includes
ALLOC_NON_BLOCK which GFP_ATOMIC allocations have) does not subtract the
reserve_highatomic, so the allocations pass the watermarks
- but in rmqueue_buddy() only ALLOC_OOM is able to fallback into highatomic
- unreserve_highatomic_pageblock() is only called from reclaim and there's
no reclaim for GFP_ATOMIC

(also worth checking if kswapd even does anything if free > high, but it's
all highatomic, maybe not? so it can't help us here)

>> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>> 9214746 total pagecache pages
>> 17797 pages in swap cache
>> Free swap  = 208645424kB
>> Total swap = 263402492kB
>> 67071581 pages RAM
>> 0 pages HighMem/MovableOnly
>> 1220888 pages reserved
> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux