Re: [PATCH] mm: fix hard lockup in __split_huge_page

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 18 Jun 2024 04:19:05 +0100

On Tue, Jun 18, 2024 at 10:09:26AM +0800, zhaoyang.huang wrote:
> Hard lockup[2] is reported which should be caused by recursive
> lock contention of lruvec->lru_lock[1] within __split_huge_page.
> 
> [1]
> static void __split_huge_page(struct page *page, struct list_head *list,
>                 pgoff_t end, unsigned int new_order)
> {
>         /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> //1st lock here
>         lruvec = folio_lruvec_lock(folio);
> 
>         for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
>                 __split_huge_page_tail(folio, i, lruvec, list, new_order);
>                 /* Some pages can be beyond EOF: drop them from page cache */
>                 if (head[i].index >= end) {
>                         folio_put(tail);
>                             __page_cache_release
> //2nd lock here
>                                folio_lruvec_relock_irqsave

Why doesn't lockdep catch this?

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9859aa4f7553..ea504df46d3b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2925,7 +2925,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  				folio_account_cleaned(tail,
>  					inode_to_wb(folio->mapping->host));
>  			__filemap_remove_folio(tail, NULL);
> +			unlock_page_lruvec(lruvec);
>  			folio_put(tail);
> +			folio_lruvec_lock(folio);

Why is it safe to drop & reacquire this lock?  Is there nothing we need
to revalidate?