On 09/07/2024 06:58, Yu Zhao wrote: > On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@xxxxxxx> wrote: >> >> On 08-Jul-24 9:47 PM, Yu Zhao wrote: >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@xxxxxxx> wrote: >>>> >>>> Hi Yu Zhao, >>>> >>>> Thanks for your patches. See below... >>>> >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>>>> Hi Bharata, >>>>> >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@xxxxxxx> wrote: >>>>>> >>>> <snip> >>>>>> >>>>>> Some experiments tried >>>>>> ====================== >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>>>> >>>>> This is not really an MGLRU issue -- can you please try one of the >>>>> attached patches? It (truncate.patch) should help with or without >>>>> MGLRU. >>>> >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen. >>> >>> Thanks. >>> >>> In your original report, you said: >>> >>> Most of the times the two contended locks are lruvec and >>> inode->i_lock spinlocks. >>> ... >>> Often times, the perf output at the time of the problem shows >>> heavy contention on lruvec spin lock. Similar contention is >>> also observed with inode i_lock (in clear_shadow_entry path) >>> >>> Based on this new report, does it mean the i_lock is not as contended, >>> for the same path (truncation) you tested? If so, I'll post >>> truncate.patch and add reported-by and tested-by you, unless you have >>> objections. >> >> truncate.patch has been tested on two systems with default LRU scheme >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. > > Thanks. > >>> >>> The two paths below were contended on the LRU lock, but they already >>> batch their operations. So I don't know what else we can do surgically >>> to improve them. >> >> What has been seen with this workload is that the lruvec spinlock is >> held for a long time from shrink_[active/inactive]_list path. In this >> path, there is a case in isolate_lru_folios() where scanning of LRU >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes >> scanning/skipping of more than 150 million folios were seen. There is >> already a comment in there which explains why nr_skipped shouldn't be >> counted, but is there any possibility of re-looking at this condition? > > For this specific case, probably this can help: > > @@ -1659,8 +1659,15 @@ static unsigned long > isolate_lru_folios(unsigned long nr_to_scan, > if (folio_zonenum(folio) > sc->reclaim_idx || > skip_cma(folio, sc)) { > nr_skipped[folio_zonenum(folio)] += nr_pages; > - move_to = &folios_skipped; > - goto move; > + list_move(&folio->lru, &folios_skipped); > + if (spin_is_contended(&lruvec->lru_lock)) { > + if (!list_empty(dst)) > + break; > + spin_unlock_irq(&lruvec->lru_lock); > + cond_resched(); > + spin_lock_irq(&lruvec->lru_lock); > + } > + continue; > } > Hi Yu, We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@xxxxxxxxx/ and was pointed to this fix. We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well. Thanks