On Tue, Nov 1, 2022 at 3:17 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote: > > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > > This is reading and writing the same amount of file data at the > > > > > application level, but once the data has been written and kicked out > > > > > of the page cache it seems to require an awful lot more read IO to > > > > > get it back to the application. i.e. this looks like mmap() is > > > > > readahead thrashing severely, and eventually it livelocks with this > > > > > sort of report: > > > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > > [175901.995614] Call Trace: > > > > > [175901.996090] <TASK> > > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > > [175902.008613] </TASK> > > > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > > folios, I do wonder if this is a result of some kind of race with > > > > > teardown of a large folio... > > > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > > matters whether a large folio was previously created in the cache. > > > > Through the magic of readahead, it may well have been. I suspect > > > > it's not teardown of a large folio, but splitting. Removing a > > > > page from the page cache stores to the pointer in the XArray > > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > > > We must be observing a frozen folio. There are a number of places > > > > in the MM which freeze a folio, but the obvious one is splitting. > > > > That looks like this: > > > > > > > > local_irq_disable(); > > > > if (mapping) { > > > > xas_lock(&xas); > > > > (...) > > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > > > But the lookup is not doing anything to prevent the split on the > > > frozen page from making progress, right? It's not holding any folio > > > references, and it's not holding the mapping tree lock, either. So > > > how does the lookup in progress prevent the page split from making > > > progress? > > > > My thinking was that it keeps hammering the ->refcount field in > > struct folio. That might prevent a thread on a different socket > > from making forward progress. In contrast, spinlocks are designed > > to be fair under contention, so by spinning on an actual lock, we'd > > remove contention on the folio. > > > > But I think the tests you've done refute that theory. I'm all out of > > ideas at the moment. Either we have a frozen folio from somebody who > > doesn't hold the lock, or we have someone who's left a frozen folio in > > the page cache. I'm leaning towards that explanation at the moment, > > but I don't have a good suggestion for debugging. > > It's something else. I got gdb attached to qemu and single stepped > the looping lookup. The context I caught this time is truncate after > unlink: > > (gdb) bt > #0 find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014 > #1 find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614, > fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095 > #2 0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1) > at mm/truncate.c:364 > #3 0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452 > #4 0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666 > #5 0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747 > #6 0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326 > #7 0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362 > #8 __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355 > #9 __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355 > #10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50 > #11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80 > #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120 > #13 0x0000000000000000 in ?? () > > The find_lock_entries() call is being asked to start at index > 25089, and we are spinning on a folio we find because > folio_try_get_rcu(folio) is failing - the folio ref count is zero. > > The xas state on lookup is: > > (gdb) p *xas > $6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000', > xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data> > > indicating that we are trying to look up index 25092 (3 pages > further in than the start of the batch), and the folio that this > keeps returning is this: > > (gdb) p *folio > $7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, > mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = { > counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008, > prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = { > next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}}, > mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50, > _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, { > compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590}, > compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912, > _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>, > prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50, > _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}}, > ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = { > next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = { > counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b', > _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0} > (gdb) > > The folio has a NULL mapping, and an index of 18688, which means > even if it was not a folio that has been invalidated or freed, the > index is way outside the range we are looking for. > > If I step it round the lookup loop, xas does not change, and the > same folio is returned every time through the loop. Perhaps > the mapping tree itself might be corrupt??? > > It's simple enough to stop the machine once it has become stuck to > observe the iteration and dump structures, just tell me what you > need to know from here... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx This bug emerges again and I would like to propose a reproduce sequence of this bug which has nothing to do with scheduler stuff ( this could be wrong and sorry for wasting your time if so) Thread_isolate: 1. alloc_contig_range->isolate_migratepages_block isolate a certain of pages to cc->migratepages (folio has refcount: 1 + n (alloc_pages, page_cache)) 2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 + extra_pins) set the folio->refcnt to 0 3. alloc_contig_range->migrate_pages->xas_split split the folios to each slot as folio from slot[offset] to slot[offset + sibs] Thread_truncate: 4. enter the livelock by the chain below rcu_read_lock(); find_get_entry folio = xas_find if(!folio_try_get_rcu) xas_reset; rcu_read_unlock(); 4'. alloc_contig_range->migrate_pages->__split_huge_page which will modify folio's refcnt to 2 and breaks the livelock but is blocked by lruvec->lock's contention If the above call chain makes sense, could we solve this by below modification which has split_folio and __split_huge_page be atomic by taking lruvec->lock earlier than now. int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order) { + lruvec = folio_lruvec_lock(folio); if (mapping) { int nr = folio_nr_pages(folio); xas_split(&xas, folio, folio_order(folio)); if (folio_test_pmd_mappable(folio) && new_order < HPAGE_PMD_ORDER) { if (folio_test_swapbacked(folio)) { __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr); } else { __lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr); filemap_nr_thps_dec(mapping); } } } __split_huge_page(page, list, end, new_order); + folio_lruvec_unlock(folio);