zswap doing io in GFP_NOIO reclaim context

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 20 Mar 2024 23:54:53 -0400

just got this bug report, things wildly backed up in bcachefs and do
some digging and it looks like zswap is to blame

[10264.128242] sysrq: Show Blocked State
[10264.128268] task:kworker/20:0H   state:D stack:0     pid:143   tgid:143   ppid:2      flags:0x00004000
[10264.128271] Workqueue: bcachefs_io btree_write_submit [bcachefs]
[10264.128295] Call Trace:
[10264.128295]  <TASK>
[10264.128297]  __schedule+0x3e6/0x1520
[10264.128301]  ? ttwu_do_activate+0x64/0x200
[10264.128303]  schedule+0x32/0xd0
[10264.128304]  schedule_timeout+0x98/0x160
[10264.128306]  ? __pfx_process_timeout+0x10/0x10
[10264.128308]  io_schedule_timeout+0x50/0x80
[10264.128309]  wait_for_completion_io_timeout+0x7f/0x180
[10264.128310]  submit_bio_wait+0x78/0xb0
[10264.128313]  swap_writepage_bdev_sync+0xf6/0x150
[10264.128315]  ? __pfx_submit_bio_wait_endio+0x10/0x10
[10264.128317]  zswap_writeback_entry+0xf2/0x180
[10264.128319]  shrink_memcg_cb+0xe7/0x2f0
[10264.128320]  ? xa_load+0x8c/0xe0
[10264.128321]  ? __pfx_shrink_memcg_cb+0x10/0x10
[10264.128322]  __list_lru_walk_one+0xb9/0x1d0
[10264.128324]  ? __pfx_shrink_memcg_cb+0x10/0x10
[10264.128325]  list_lru_walk_one+0x5d/0x90
[10264.128326]  zswap_shrinker_scan+0xc4/0x130
[10264.128327]  do_shrink_slab+0x13f/0x360
[10264.128328]  shrink_slab+0x28e/0x3c0
[10264.128329]  shrink_one+0x123/0x1b0
[10264.128331]  shrink_node+0x97e/0xbc0
[10264.128332]  do_try_to_free_pages+0xe7/0x5b0
[10264.128333]  try_to_free_pages+0xe1/0x200
[10264.128334]  __alloc_pages_slowpath.constprop.0+0x343/0xde0
[10264.128337]  __alloc_pages+0x32d/0x350
[10264.128338]  allocate_slab+0x400/0x460
[10264.128339]  ___slab_alloc+0x40d/0xa40
[10264.128341]  ? mempool_alloc+0x86/0x1b0
[10264.128343]  ? finish_task_switch.isra.0+0x94/0x2f0
[10264.128345]  ? __schedule+0x3ee/0x1520
[10264.128345]  kmem_cache_alloc+0x2e7/0x330
[10264.128347]  ? mempool_alloc+0x86/0x1b0
[10264.128348]  mempool_alloc+0x86/0x1b0
[10264.128349]  bio_alloc_bioset+0x200/0x4f0
[10264.128351]  ? __queue_work.part.0+0x1a5/0x390
[10264.128352]  bio_alloc_clone+0x23/0x60
[10264.128354]  alloc_io+0x26/0xf0 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128361]  dm_submit_bio+0xb8/0x580 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128366]  __submit_bio+0xb0/0x170
[10264.128367]  submit_bio_noacct_nocheck+0x159/0x370
[10264.128368]  bch2_submit_wbio_replicas+0x21c/0x3a0 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128391]  btree_write_submit+0x1cf/0x220 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128406]  process_one_work+0x178/0x350
[10264.128408]  worker_thread+0x30f/0x450
[10264.128409]  ? __pfx_worker_thread+0x10/0x10
[10264.128409]  kthread+0xe5/0x120

dm is using GFP_NOIO for that allocation, so zswap is clearly busted.

We're already under generic_make_request(), so that submit_bio_wait()
that zswap kicked off is never going to return.

We need to think about how to add some assertions so that we know
reclaim context is being honoured...