Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/03/2024 08:05, Ryan Roberts wrote:
> On 09/03/2024 04:52, Matthew Wilcox wrote:
>> On Fri, Mar 08, 2024 at 08:34:15PM -0800, Andrew Morton wrote:
>>>
>>> We seem to be coming down to the wire on this one - Linus might release
>>> 6.8 this weekend.
>>>
>>> Will simply dropping "mm: allow non-hugetlb large folios to be batch
>>> processed" from mm-stable get us out of trouble?
>>
>> We can add a fix patch which re-narrows the race to the point where it's
>> no longer observable.  Obviously we need to figure out what the real
>> problem is, but we could be going back a long way.  We've definitely
>> found two bugs in the process of investigating the problem (of arguable
>> import; the migration one merely wastes memory temporarily and it's not
>> entirely clear that the wrong-lock problem definitely causes a crash)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 6b697d33fa5b..7b1d3144391b 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>>  			free_huge_folio(folio);
>>  			continue;
>>  		}
>> +		if (folio_test_large(folio) && folio_test_large_rmappable(folio))
>> +			folio_undo_large_rmappable(folio);
>>  
>>  		__page_cache_release(folio, &lruvec, &flags);
>>  
> 
> I agree this is likely to re-hide the problems. But I haven't actually tested it
> on it's own without the other fixes. I'll do some more testing with your latest
> patch and if that doesn't lead anywhere, I'll test with this on its own to check
> that I can no longer reproduce the crashes. If it hides them, I think this is
> the best short-term solution we have right now.

I've tested this workaround on immediately on top of commit f77171d241e3 ("mm:
allow non-hugetlb large folios to be batch processed") and can't reproduce any
problem. I've run the test 32 times. Without the workaround, the biggest number
of test repeats I've managed before seeing a problem is ~5. So I'm confident
this will be sufficient as a short-term solution.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux