On Mon, 20 Feb 2023, Huang, Ying wrote: > Hi, Hugh, > > Hugh Dickins <hughd@xxxxxxxxxx> writes: > > > On Mon, 13 Feb 2023, Huang Ying wrote: > > > >> From: "Huang, Ying" <ying.huang@xxxxxxxxx> > >> > >> Now, migrate_pages() migrate folios one by one, like the fake code as > >> follows, > >> > >> for each folio > >> unmap > >> flush TLB > >> copy > >> restore map > >> > >> If multiple folios are passed to migrate_pages(), there are > >> opportunities to batch the TLB flushing and copying. That is, we can > >> change the code to something as follows, > >> > >> for each folio > >> unmap > >> for each folio > >> flush TLB > >> for each folio > >> copy > >> for each folio > >> restore map > >> > >> The total number of TLB flushing IPI can be reduced considerably. And > >> we may use some hardware accelerator such as DSA to accelerate the > >> folio copying. > >> > >> So in this patch, we refactor the migrate_pages() implementation and > >> implement the TLB flushing batching. Base on this, hardware > >> accelerated folio copying can be implemented. > >> > >> If too many folios are passed to migrate_pages(), in the naive batched > >> implementation, we may unmap too many folios at the same time. The > >> possibility for a task to wait for the migrated folios to be mapped > >> again increases. So the latency may be hurt. To deal with this > >> issue, the max number of folios be unmapped in batch is restricted to > >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence > >> is at the same level of THP migration. > >> > >> We use the following test to measure the performance impact of the > >> patchset, > >> > >> On a 2-socket Intel server, > >> > >> - Run pmbench memory accessing benchmark > >> > >> - Run `migratepages` to migrate pages of pmbench between node 0 and > >> node 1 back and forth. > >> > >> With the patch, the TLB flushing IPI reduces 99.1% during the test and > >> the number of pages migrated successfully per second increases 291.7%. > >> > >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores, > >> 2 NUMA nodes. Test results show that the page migration performance > >> increases up to 78%. > >> > >> This patchset is based on mm-unstable 2023-02-10. > > > > And back in linux-next this week: I tried next-20230217 overnight. > > > > There is a deadlock in this patchset (and in previous versions: sorry > > it's taken me so long to report), but I think one that's easily solved. > > > > I've not bisected to precisely which patch (load can take several hours > > to hit the deadlock), but it doesn't really matter, and I expect that > > you can guess. > > > > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), > > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. > > So, plenty of ext4 page cache and buffer_heads. > > > > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), > > either in kcompactd0 or in khugepaged trying to compact, or in both: > > it ends up calling __lock_buffer(), and that schedules away, waiting > > forever to get BH_lock. I have not identified who is holding BH_lock, > > but I imagine a jbd2 journalling thread, and presume that it wants one > > of the folio locks which migrate_pages_batch() is already holding; or > > maybe it's all more convoluted than that. Other tasks then back up > > waiting on those folio locks held in the batch. > > > > Never a problem with buffer_migrate_folio(), always with the "more > > careful" buffer_migrate_folio_norefs(). And the patch below fixes > > it for me: I've had enough hours with it now, on enough occasions, > > to be confident of that. > > > > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > > very well, and I hope can assure us that there is an understandable > > deadlock here, from holding several random folio locks, then trying > > to lock buffers. Cc'ing fsdevel, because there's a risk that mm > > folk think something is safe, when it's not sufficient to cope with > > the diversity of filesystems. I hope nothing more than the below is > > needed (and I've had no other problems with the patchset: good job), > > but cannot be sure. > > > > [PATCH next] migrate_pages: fix deadlock on buffer heads > > > > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), > > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only > > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. > > > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > > > > --- next-20230217/mm/migrate.c > > +++ fixed/mm/migrate.c > > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct > > if (folio_ref_count(src) != expected_count) > > return -EAGAIN; > > > > - if (!buffer_migrate_lock_buffers(head, mode)) > > + if (!buffer_migrate_lock_buffers(head, > > + check_refs ? MIGRATE_ASYNC : mode)) > > return -EAGAIN; > > > > if (check_refs) { > > Thank you very much for pointing this out and the fix patch. Today, my > colleague Pengfei reported a deadlock bug to me. It seems that we > cannot wait the writeback to complete when we have locked some folios. > Below patch can fix that deadlock. I don't know whether this is related > to the deadlock you run into. It appears that we should avoid to > lock/wait synchronously if we have locked more than one folios. Thanks, I've checked now, on next-20230217 without my patch but with your patch below: it took a few hours, but still deadlocks as I described above, so it's not the same issue. Yes, that's a good principle, that we should avoid to lock/wait synchronously once we have locked one folio (hmm, above you say "more than one": I think we mean the same thing, we're just stating it differently, given how the code runs at present). I'm not a great fan of migrate_folio_unmap()'s arguments, "force" followed by "oh, but don't force" (but applaud the recent "avoid_force_lock" as much better than the original "force_lock"). I haven't tried, but I wonder if you can avoid both those arguments, and both of these patches, by passing down an adjusted mode (perhaps MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first folio of a batch has been acquired (then restore to the original mode when starting a new batch). (My patch is weak in that it trylocks for buffer_head even on the first folio of a MIGRATE_SYNC norefs batch, although that has never given a problem historically: adjusting the mode after acquiring the first folio would correct that weakness.) Hugh > > Best Regards, > Huang, Ying > > ------------------------------------8<------------------------------------ > From 0699fa2f80a67e863107d49a25909c92b900a9be Mon Sep 17 00:00:00 2001 > From: Huang Ying <ying.huang@xxxxxxxxx> > Date: Mon, 20 Feb 2023 14:56:34 +0800 > Subject: [PATCH] migrate_pages: fix deadlock on waiting writeback > > Pengfei reported a system soft lockup issue with Syzkaller. The stack > traces are as follows, > > ... > [ 300.124933] INFO: task kworker/u4:3:73 blocked for more than 147 seconds. > [ 300.125214] Not tainted 6.2.0-rc4-kvm+ #1314 > [ 300.125408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 300.125736] task:kworker/u4:3 state:D stack:0 pid:73 ppid:2 flags:0x00004000 > [ 300.126059] Workqueue: writeback wb_workfn (flush-7:3) > [ 300.126282] Call Trace: > [ 300.126378] <TASK> > [ 300.126464] __schedule+0x43b/0xd00 > [ 300.126601] ? __blk_flush_plug+0x142/0x180 > [ 300.126765] schedule+0x6a/0xf0 > [ 300.126912] io_schedule+0x4a/0x80 > [ 300.127051] folio_wait_bit_common+0x1b5/0x4e0 > [ 300.127227] ? __pfx_wake_page_function+0x10/0x10 > [ 300.127403] __folio_lock+0x27/0x40 > [ 300.127541] write_cache_pages+0x350/0x870 > [ 300.127699] ? __pfx_iomap_do_writepage+0x10/0x10 > [ 300.127889] iomap_writepages+0x3f/0x80 > [ 300.128037] xfs_vm_writepages+0x94/0xd0 > [ 300.128192] ? __pfx_xfs_vm_writepages+0x10/0x10 > [ 300.128370] do_writepages+0x10a/0x240 > [ 300.128514] ? lock_is_held_type+0xe6/0x140 > [ 300.128675] __writeback_single_inode+0x9f/0xa90 > [ 300.128854] writeback_sb_inodes+0x2fb/0x8d0 > [ 300.129030] __writeback_inodes_wb+0x68/0x150 > [ 300.129212] wb_writeback+0x49c/0x770 > [ 300.129357] wb_workfn+0x6fb/0x9d0 > [ 300.129500] process_one_work+0x3cc/0x8d0 > [ 300.129669] worker_thread+0x66/0x630 > [ 300.129824] ? __pfx_worker_thread+0x10/0x10 > [ 300.129989] kthread+0x153/0x190 > [ 300.130116] ? __pfx_kthread+0x10/0x10 > [ 300.130264] ret_from_fork+0x29/0x50 > [ 300.130409] </TASK> > [ 300.179347] INFO: task repro:1023 blocked for more than 147 seconds. > [ 300.179905] Not tainted 6.2.0-rc4-kvm+ #1314 > [ 300.180317] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 300.180955] task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 > [ 300.181660] Call Trace: > [ 300.181879] <TASK> > [ 300.182085] __schedule+0x43b/0xd00 > [ 300.182407] schedule+0x6a/0xf0 > [ 300.182694] io_schedule+0x4a/0x80 > [ 300.183020] folio_wait_bit_common+0x1b5/0x4e0 > [ 300.183506] ? compaction_alloc+0x77/0x1150 > [ 300.183892] ? __pfx_wake_page_function+0x10/0x10 > [ 300.184304] folio_wait_bit+0x30/0x40 > [ 300.184640] folio_wait_writeback+0x2e/0x1e0 > [ 300.185034] migrate_pages_batch+0x555/0x1ac0 > [ 300.185462] ? __pfx_compaction_alloc+0x10/0x10 > [ 300.185808] ? __pfx_compaction_free+0x10/0x10 > [ 300.186022] ? __this_cpu_preempt_check+0x17/0x20 > [ 300.186234] ? lock_is_held_type+0xe6/0x140 > [ 300.186423] migrate_pages+0x100e/0x1180 > [ 300.186603] ? __pfx_compaction_free+0x10/0x10 > [ 300.186800] ? __pfx_compaction_alloc+0x10/0x10 > [ 300.187011] compact_zone+0xe10/0x1b50 > [ 300.187182] ? lock_is_held_type+0xe6/0x140 > [ 300.187374] ? check_preemption_disabled+0x80/0xf0 > [ 300.187588] compact_node+0xa3/0x100 > [ 300.187755] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > [ 300.187993] ? _find_first_bit+0x7b/0x90 > [ 300.188171] sysctl_compaction_handler+0x5d/0xb0 > [ 300.188376] proc_sys_call_handler+0x29d/0x420 > [ 300.188583] proc_sys_write+0x2b/0x40 > [ 300.188749] vfs_write+0x3a3/0x780 > [ 300.188912] ksys_write+0xb7/0x180 > [ 300.189070] __x64_sys_write+0x26/0x30 > [ 300.189260] do_syscall_64+0x3b/0x90 > [ 300.189424] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [ 300.189654] RIP: 0033:0x7f3a2471f59d > [ 300.189815] RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 > [ 300.190137] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d > [ 300.190397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 > [ 300.190653] RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 > [ 300.190910] R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 > [ 300.191172] R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 > [ 300.191440] </TASK> > ... > > To migrate a folio, we may wait the writeback of a folio to complete > when we already have held the lock of some folios. But the writeback > code may wait to lock some folio we held lock. This causes the > deadlock. To fix the issue, we will avoid to wait the writeback to > complete if we have locked some folios. After moving the locked > folios and unlocked, we will retry. > > Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx> > Reported-by: "Xu, Pengfei" <pengfei.xu@xxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Stefan Roesch <shr@xxxxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: Xin Hao <xhao@xxxxxxxxxxxxxxxxx> > Cc: Zi Yan <ziy@xxxxxxxxxx> > Cc: Yang Shi <shy828301@xxxxxxxxx> > Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > mm/migrate.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 28b435cdeac8..bc9a8050f1b0 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1205,6 +1205,18 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page > } > if (!force) > goto out; > + /* > + * We have locked some folios and are going to wait the > + * writeback of this folio to complete. But it's possible for > + * the writeback to wait to lock the folios we have locked. To > + * avoid a potential deadlock, let's bail out and not do that. > + * The locked folios will be moved and unlocked, then we > + * can wait the writeback of this folio. > + */ > + if (avoid_force_lock) { > + rc = -EDEADLOCK; > + goto out; > + } > folio_wait_writeback(src); > } > > -- > 2.39.1 > >