On Sun, Jul 14, 2024 at 11:20 PM Bharata B Rao <bharata@xxxxxxx> wrote: > > On 11-Jul-24 11:13 AM, Bharata B Rao wrote: > > On 09-Jul-24 11:28 AM, Yu Zhao wrote: > >> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@xxxxxxx> wrote: > >>> > >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote: > >>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@xxxxxxx> wrote: > >>>>> > >>>>> Hi Yu Zhao, > >>>>> > >>>>> Thanks for your patches. See below... > >>>>> > >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >>>>>> Hi Bharata, > >>>>>> > >>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@xxxxxxx> wrote: > >>>>>>> > >>>>> <snip> > >>>>>>> > >>>>>>> Some experiments tried > >>>>>>> ====================== > >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard > >>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. > >>>>>> > >>>>>> This is not really an MGLRU issue -- can you please try one of the > >>>>>> attached patches? It (truncate.patch) should help with or without > >>>>>> MGLRU. > >>>>> > >>>>> With truncate.patch and default LRU scheme, a few hard lockups are > >>>>> seen. > >>>> > >>>> Thanks. > >>>> > >>>> In your original report, you said: > >>>> > >>>> Most of the times the two contended locks are lruvec and > >>>> inode->i_lock spinlocks. > >>>> ... > >>>> Often times, the perf output at the time of the problem shows > >>>> heavy contention on lruvec spin lock. Similar contention is > >>>> also observed with inode i_lock (in clear_shadow_entry path) > >>>> > >>>> Based on this new report, does it mean the i_lock is not as contended, > >>>> for the same path (truncation) you tested? If so, I'll post > >>>> truncate.patch and add reported-by and tested-by you, unless you have > >>>> objections. > >>> > >>> truncate.patch has been tested on two systems with default LRU scheme > >>> and the lockup due to inode->i_lock hasn't been seen yet after 24 > >>> hours run. > >> > >> Thanks. > >> > >>>> > >>>> The two paths below were contended on the LRU lock, but they already > >>>> batch their operations. So I don't know what else we can do surgically > >>>> to improve them. > >>> > >>> What has been seen with this workload is that the lruvec spinlock is > >>> held for a long time from shrink_[active/inactive]_list path. In this > >>> path, there is a case in isolate_lru_folios() where scanning of LRU > >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes > >>> scanning/skipping of more than 150 million folios were seen. There is > >>> already a comment in there which explains why nr_skipped shouldn't be > >>> counted, but is there any possibility of re-looking at this condition? > >> > >> For this specific case, probably this can help: > >> > >> @@ -1659,8 +1659,15 @@ static unsigned long > >> isolate_lru_folios(unsigned long nr_to_scan, > >> if (folio_zonenum(folio) > sc->reclaim_idx || > >> skip_cma(folio, sc)) { > >> nr_skipped[folio_zonenum(folio)] += nr_pages; > >> - move_to = &folios_skipped; > >> - goto move; > >> + list_move(&folio->lru, &folios_skipped); > >> + if (spin_is_contended(&lruvec->lru_lock)) { > >> + if (!list_empty(dst)) > >> + break; > >> + spin_unlock_irq(&lruvec->lru_lock); > >> + cond_resched(); > >> + spin_lock_irq(&lruvec->lru_lock); > >> + } > >> + continue; > >> } > > > > Thanks, this helped. With this fix, the test ran for 24hrs without any > > lockups attributable to lruvec spinlock. As noted in this thread, > > earlier isolate_lru_folios() used to scan millions of folios and spend a > > lot of time with spinlock held but after this fix, such a scenario is no > > longer seen. > > However during the weekend mglru-enabled run (with above fix to > isolate_lru_folios() and also the previous two patches: truncate.patch > and mglru.patch and the inode fix provided by Mateusz), another hard > lockup related to lruvec spinlock was observed. Thanks again for the stress tests. I can't come up with any reasonable band-aid at this moment, i.e., something not too ugly to work around a more fundamental scalability problem. Before I give up: what type of dirty data was written back to the nvme device? Was it page cache or swap? > Here is the hardlock up: > > watchdog: Watchdog detected hard LOCKUP on cpu 466 > CPU: 466 PID: 3103929 Comm: fio Not tainted > 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? watchdog_hardlockup_check+0x1b4/0x3a0 > <SNIP> > ? native_queued_spin_lock_slowpath+0x2b4/0x300 > </NMI> > <IRQ> > _raw_spin_lock_irqsave+0x5b/0x70 > folio_lruvec_lock_irqsave+0x62/0x90 > folio_batch_move_lru+0x9d/0x160 > folio_rotate_reclaimable+0xab/0xf0 > folio_end_writeback+0x60/0x90 > end_buffer_async_write+0xaa/0xe0 > end_bio_bh_io_sync+0x2c/0x50 > bio_endio+0x108/0x180 > blk_mq_end_request_batch+0x11f/0x5e0 > nvme_pci_complete_batch+0xb5/0xd0 [nvme] > nvme_irq+0x92/0xe0 [nvme] > __handle_irq_event_percpu+0x6e/0x1e0 > handle_irq_event+0x39/0x80 > handle_edge_irq+0x8c/0x240 > __common_interrupt+0x4e/0xf0 > common_interrupt+0x49/0xc0 > asm_common_interrupt+0x27/0x40 > > Here is the lock holder details captured by all-cpu-backtrace: > > NMI backtrace for cpu 75 > CPU: 75 PID: 3095650 Comm: fio Not tainted > 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > RIP: 0010:folio_inc_gen+0x142/0x430 > Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? nmi_cpu_backtrace+0xc5/0x130 > ? nmi_cpu_backtrace_handler+0x11/0x20 > ? nmi_handle+0x64/0x180 > ? default_do_nmi+0x45/0x130 > ? exc_nmi+0x128/0x1a0 > ? end_repeat_nmi+0xf/0x53 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > </NMI> > <TASK> > isolate_folios+0x954/0x1630 > evict_folios+0xa5/0x8c0 > try_to_shrink_lruvec+0x1be/0x320 > shrink_one+0x10f/0x1d0 > shrink_node+0xa4c/0xc90 > do_try_to_free_pages+0xc0/0x590 > try_to_free_pages+0xde/0x210 > __alloc_pages_noprof+0x6ae/0x12c0 > alloc_pages_mpol_noprof+0xd9/0x220 > folio_alloc_noprof+0x63/0xe0 > filemap_alloc_folio_noprof+0xf4/0x100 > page_cache_ra_unbounded+0xb9/0x1a0 > page_cache_ra_order+0x26e/0x310 > ondemand_readahead+0x1a3/0x360 > page_cache_sync_ra+0x83/0x90 > filemap_get_pages+0xf0/0x6a0 > filemap_read+0xe7/0x3d0 > blkdev_read_iter+0x6f/0x140 > vfs_read+0x25b/0x340 > ksys_read+0x67/0xf0 > __x64_sys_read+0x19/0x20 > x64_sys_call+0x1771/0x20d0 > do_syscall_64+0x7e/0x130 > > Regards, > Bharata.