On Thu, Oct 20, 2022 at 09:46:16AM +1100, Dave Chinner wrote: > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > This is reading and writing the same amount of file data at the > > > > application level, but once the data has been written and kicked out > > > > of the page cache it seems to require an awful lot more read IO to > > > > get it back to the application. i.e. this looks like mmap() is > > > > readahead thrashing severely, and eventually it livelocks with this > > > > sort of report: > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > [175901.995614] Call Trace: > > > > [175901.996090] <TASK> > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > [175902.008613] </TASK> > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > folios, I do wonder if this is a result of some kind of race with > > > > teardown of a large folio... > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > matters whether a large folio was previously created in the cache. > > > Through the magic of readahead, it may well have been. I suspect > > > it's not teardown of a large folio, but splitting. Removing a > > > page from the page cache stores to the pointer in the XArray > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > We must be observing a frozen folio. There are a number of places > > > in the MM which freeze a folio, but the obvious one is splitting. > > > That looks like this: > > > > > > local_irq_disable(); > > > if (mapping) { > > > xas_lock(&xas); > > > (...) > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > But the lookup is not doing anything to prevent the split on the > > frozen page from making progress, right? It's not holding any folio > > references, and it's not holding the mapping tree lock, either. So > > how does the lookup in progress prevent the page split from making > > progress? > > > > > > > So one way to solve this might be to try to take the xa_lock on > > > failure to get the refcount. Otherwise a high-priority task > > > might spin forever without a low-priority task getting the chance > > > to finish the work being done while the folio is frozen. > > > > IIUC, then you are saying that there is a scheduling priority > > inversion because the lookup failure looping path doesn't yeild the > > CPU? > > > > If so, how does taking the mapping tree spin lock on failure cause > > the looping task to yield the CPU and hence allow the folio split to > > make progress? > > > > Also, AFAICT, the page split has disabled local interrupts, so it > > should effectively be running with preemption disabled as it has > > turned off the mechanism the scheduler can use to preempt it. The > > page split can't sleep, either, because it holds the mapping tree > > lock. Hence I can't see how a split-in-progress can be preempted in > > teh first place to cause a priority inversion livelock like this... > > > > > ie this: > > > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > index 08341616ae7a..ca0eed80580f 100644 > > > --- a/mm/filemap.c > > > +++ b/mm/filemap.c > > > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index) > > > if (!folio || xa_is_value(folio)) > > > goto out; > > > > > > - if (!folio_try_get_rcu(folio)) > > > + if (!folio_try_get_rcu(folio)) { > > > + unsigned long flags; > > > + > > > + xas_lock_irqsave(&xas, flags); > > > + xas_unlock_irqrestore(&xas, flags); > > > goto repeat; > > > + } > > As I suspected, this change did not prevent the livelock. It > reproduced after just 650 test iterations (a few minutes) with this > change in place. > > > I would have thought: > > > > if (!folio_try_get_rcu(folio)) { > > rcu_read_unlock(); > > cond_resched(); > > rcu_read_lock(); > > goto repeat; > > } > > > > Would be the right way to yeild the CPU to avoid priority > > inversion related livelocks here... > > I'm now trying this just to provide a data point that actually > yeilding the CPU avoids the downstream effects of the livelock (i.e. This causes a different problems to occur. First, a task hangs evicting the page cache on unlink like so: [ 1352.990246] INFO: task test_write:3572 blocked for more than 122 seconds. [ 1352.991418] Not tainted 6.1.0-rc1-dgc+ #1618 [ 1352.992378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1352.994702] task:test_write state:D stack:13088 pid:3572 ppid:3567 flags:0x00004002 [ 1352.996257] Call Trace: [ 1352.996905] <TASK> [ 1352.997452] __schedule+0x2f9/0xa20 [ 1352.998380] schedule+0x5a/0xc0 [ 1352.999011] io_schedule+0x42/0x70 [ 1352.999565] folio_wait_bit_common+0x137/0x370 [ 1353.000553] ? dio_warn_stale_pagecache.part.0+0x50/0x50 [ 1353.001425] truncate_inode_pages_range+0x41a/0x470 [ 1353.002443] evict+0x1ad/0x1c0 [ 1353.003199] do_unlinkat+0x1db/0x2e0 [ 1353.003907] __x64_sys_unlinkat+0x33/0x60 [ 1353.004569] do_syscall_64+0x35/0x80 [ 1353.005354] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 1353.006315] RIP: 0033:0x7f03d6a42157 [ 1353.007088] RSP: 002b:00007ffde8d98388 EFLAGS: 00000206 ORIG_RAX: 0000000000000107 [ 1353.008367] RAX: ffffffffffffffda RBX: 0000000000000df4 RCX: 00007f03d6a42157 [ 1353.009748] RDX: 0000000000000000 RSI: 000055f8660415a0 RDI: 0000000000000007 [ 1353.011115] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007ffde8d96105 [ 1353.012422] R10: 0000000000000158 R11: 0000000000000206 R12: 0000000000000158 [ 1353.013556] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1353.014958] </TASK> And soon after: [ 1607.746983] ------------[ cut here ]------------ [ 1607.748141] WARNING: CPU: 27 PID: 51385 at fs/iomap/buffered-io.c:80 iomap_page_release+0xaf/0xc0 [ 1607.751098] Modules linked in: [ 1607.751619] CPU: 27 PID: 51385 Comm: test_write Not tainted 6.1.0-rc1-dgc+ #1618 [ 1607.753307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 1607.755443] RIP: 0010:iomap_page_release+0xaf/0xc0 [ 1607.756496] Code: 48 c1 e8 02 83 e0 01 75 10 38 d0 75 21 5b 4c 89 e7 5d 41 5c e9 d2 be ee ff eb ee e8 bb e0 ec ff eb 88 48 8b 07 5b 5d 41 5c c3 <0f> 0b eb b1 0f 0b eb db 0f 0b eb b2 0f 1f 44 00 00 0f 1f 44 00 00 [ 1607.759643] RSP: 0018:ffffc900087cb5b0 EFLAGS: 00010206 [ 1607.760775] RAX: 0000000000200000 RBX: ffffea0004378000 RCX: 000000000000000c [ 1607.762181] RDX: 0017ffffc0050081 RSI: ffff888145cb0178 RDI: ffffea0004378000 [ 1607.764408] RBP: 0000000000000200 R08: 0000000000200000 R09: ffffea0004378008 [ 1607.765560] R10: ffff888141988800 R11: 0000000000033900 R12: ffff88810200ee40 [ 1607.766988] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008 [ 1607.768347] FS: 00007f03d5c00640(0000) GS:ffff88833bdc0000(0000) knlGS:0000000000000000 [ 1607.769756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1607.771421] CR2: 00007f03cae00000 CR3: 0000000246226006 CR4: 0000000000770ee0 [ 1607.773298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1607.775026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1607.777341] PKRU: 55555554 [ 1607.777995] Call Trace: [ 1607.778637] <TASK> [ 1607.779372] iomap_release_folio+0x4d/0xa0 [ 1607.780510] shrink_folio_list+0x9dd/0xb10 [ 1607.781931] shrink_lruvec+0x558/0xb80 [ 1607.782922] shrink_node+0x2ce/0x6f0 [ 1607.783993] do_try_to_free_pages+0xd5/0x570 [ 1607.785031] try_to_free_mem_cgroup_pages+0x105/0x220 [ 1607.786728] reclaim_high+0xa6/0xf0 [ 1607.787672] mem_cgroup_handle_over_high+0x97/0x290 [ 1607.789179] try_charge_memcg+0x697/0x7f0 [ 1607.790100] charge_memcg+0x35/0xd0 [ 1607.790930] __mem_cgroup_charge+0x29/0x80 [ 1607.792884] __filemap_add_folio+0x21a/0x460 [ 1607.794342] ? scan_shadow_nodes+0x30/0x30 [ 1607.796052] filemap_add_folio+0x37/0xa0 [ 1607.797036] __filemap_get_folio+0x221/0x360 [ 1607.798468] ? xas_load+0x5/0xa0 [ 1607.799417] iomap_write_begin+0x103/0x580 [ 1607.800908] ? filemap_dirty_folio+0x5c/0x80 [ 1607.801940] ? iomap_write_end+0x101/0x250 [ 1607.803305] iomap_file_buffered_write+0x17f/0x330 [ 1607.804700] xfs_file_buffered_write+0xb1/0x2e0 [ 1607.806240] ? xfs_file_buffered_write+0x2b2/0x2e0 [ 1607.807743] vfs_write+0x2ca/0x3d0 [ 1607.808622] __x64_sys_pwrite64+0x8c/0xc0 [ 1607.810502] do_syscall_64+0x35/0x80 [ 1607.811511] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 1607.813116] RIP: 0033:0x7f03d6b40a87 .... [ 1607.925434] kernel BUG at mm/vmscan.c:1306! [ 1607.926977] invalid opcode: 0000 [#1] PREEMPT SMP [ 1607.928427] CPU: 8 PID: 51385 Comm: test_write Tainted: G W 6.1.0-rc1-dgc+ #1618 [ 1607.930742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 1607.932716] RIP: 0010:__remove_mapping+0x229/0x240 [ 1607.933648] Code: a9 00 00 08 00 75 27 48 8b 3b 48 81 c7 88 00 00 00 e8 db 64 c1 00 31 c0 e9 d6 fe ff ff 48 8b 3b e8 7c 29 0d 00 e9 27 ff ff ff <0f> 0b 0f 0b 48 8b 45 00 f6 c4 04 74 d0 31 c0 e9 b5 fe ff ff 0f 1f [ 1607.937079] RSP: 0018:ffffc900087cb5d8 EFLAGS: 00010246 [ 1607.938050] RAX: 0017ffffc0010004 RBX: ffffc900087cb6e8 RCX: ffff888244bc5000 [ 1607.939443] RDX: 0000000000000001 RSI: ffffea0004378000 RDI: ffff888145cb0318 [ 1607.940980] RBP: ffffc900087cb648 R08: ffff888145cb0318 R09: ffffffff813be400 [ 1607.942300] R10: ffff88810200ee40 R11: 0000000000033900 R12: ffffea0004378000 [ 1607.943666] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008 [ 1607.945004] FS: 00007f03d5c00640(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000 [ 1607.946492] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1607.947552] CR2: 00007efd1ae33da8 CR3: 0000000246226001 CR4: 0000000000770ee0 [ 1607.948866] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1607.950250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1607.951738] PKRU: 55555554 [ 1607.952289] Call Trace: [ 1607.952775] <TASK> [ 1607.953190] shrink_folio_list+0x5cd/0xb10 [ 1607.954140] shrink_lruvec+0x558/0xb80 [ 1607.954850] shrink_node+0x2ce/0x6f0 [ 1607.955536] do_try_to_free_pages+0xd5/0x570 [ 1607.956397] try_to_free_mem_cgroup_pages+0x105/0x220 [ 1607.957503] reclaim_high+0xa6/0xf0 [ 1607.958205] mem_cgroup_handle_over_high+0x97/0x290 [ 1607.959149] try_charge_memcg+0x697/0x7f0 [ 1607.959932] charge_memcg+0x35/0xd0 [ 1607.960549] __mem_cgroup_charge+0x29/0x80 [ 1607.961257] __filemap_add_folio+0x21a/0x460 [ 1607.961912] ? scan_shadow_nodes+0x30/0x30 [ 1607.962705] filemap_add_folio+0x37/0xa0 [ 1607.963354] __filemap_get_folio+0x221/0x360 [ 1607.964076] ? xas_load+0x5/0xa0 [ 1607.964761] iomap_write_begin+0x103/0x580 [ 1607.965510] ? filemap_dirty_folio+0x5c/0x80 [ 1607.966349] ? iomap_write_end+0x101/0x250 [ 1607.967185] iomap_file_buffered_write+0x17f/0x330 [ 1607.968133] xfs_file_buffered_write+0xb1/0x2e0 [ 1607.968948] ? xfs_file_buffered_write+0x2b2/0x2e0 [ 1607.969783] vfs_write+0x2ca/0x3d0 [ 1607.970326] __x64_sys_pwrite64+0x8c/0xc0 [ 1607.971073] do_syscall_64+0x35/0x80 [ 1607.971732] entry_SYSCALL_64_after_hwframe+0x63/0xcd which is: 1297 * Same as remove_mapping, but if the folio is removed from the mapping, it 1298 * gets returned with a refcount of 0. 1299 */ 1300 static int __remove_mapping(struct address_space *mapping, struct folio *folio, 1301 bool reclaimed, struct mem_cgroup *target_memcg) 1302 { 1303 int refcount; 1304 void *shadow = NULL; 1305 1306 BUG_ON(!folio_test_locked(folio)); 1307 BUG_ON(mapping != folio_mapping(folio)); 1308 tripping over an unlocked folio in shrink_folio_list().... And, finally, there's also a process spinning on a cpu in truncate_inode_pages_range(): [ 3468.057406] CPU: 1 PID: 3579 Comm: test_write Tainted: G D W 6.1.0-rc1-dgc+ #1618 [ 3468.057407] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 3468.057408] RIP: 0010:truncate_inode_pages_range+0x3c0/0x470 [ 3468.057411] Code: 8b 7c e5 00 41 f6 c6 01 0f 85 62 ff ff ff be a9 03 00 00 48 c7 c7 b8 62 7b 82 e8 0b 46 e9 ff 2e 2e 2e 31 c0 f0 49 0f ba 7 [ 3468.057412] RSP: 0018:ffffc90002cffd40 EFLAGS: 00000246 [ 3468.057414] RAX: 0000000000000000 RBX: ffffc90002cffdd8 RCX: ffff8882461b8000 [ 3468.057415] RDX: 0000000000000000 RSI: 00000000000003a9 RDI: ffffffff827b62b8 [ 3468.057415] RBP: ffff888246eee718 R08: ffffffffffffffc0 R09: fffffffffffffffe [ 3468.057416] R10: 0000000000002838 R11: 0000000000002834 R12: 0000000000000004 [ 3468.057417] R13: ffffc90002cffd60 R14: ffffea0004378000 R15: 0000000000002810 [ 3468.057418] FS: 00007f03d6953740(0000) GS:ffff88813bc40000(0000) knlGS:0000000000000000 [ 3468.057420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3468.057421] CR2: 00007f03d0100000 CR3: 0000000245e2c006 CR4: 0000000000770ee0 root@test4:~# [ 3468.057422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3468.057422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3468.057423] PKRU: 55555554 [ 3468.057424] Call Trace: [ 3468.057424] <TASK> [ 3468.057426] evict+0x1ad/0x1c0 [ 3468.057429] do_unlinkat+0x1db/0x2e0 [ 3468.057432] __x64_sys_unlinkat+0x33/0x60 [ 3468.057434] do_syscall_64+0x35/0x80 [ 3468.057437] entry_SYSCALL_64_after_hwframe+0x63/0xcd So that's not the problem, either... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx