On Fri, Oct 09, 2020 at 08:07:59PM -0700, Hugh Dickins wrote: > There have been elusive reports of filemap_fault() hitting its > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built > with CONFIG_READ_ONLY_THP_FOR_FS=y. > > Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and > CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged > without NUMA reuses the same huge page after collapse_file() failed > (whereas NUMA targets its allocation to the respective node each time). > And most of us were usually testing with CONFIG_NUMA=y kernels. Good catch. There have been at least three bugs in recent times which can cause this VM_BUG_ON_PAGE() to trigger. This one, one where swapping out a THP led to all 512 entries pointing to the same non-huge page on swapin (fixed in -mm) and one that I introduced for a few weeks in -mm where failing to split a THP would lead to random tree corruption due to a non-zeroed node being freed to the slab cache. There may yet be a fourth. I've seen it occasionally in recent testing so I'll add this patch and see if it disappears. > Instead, non-NUMA khugepaged_prealloc_page() release the old page > if anyone else has a reference to it (1% of cases when I tested). I think this is a good way to fix the problem. We could also change khugepaged to insert a frozen page, ensuring that find_get_entry() would spin until the collapse has succeeded or the page was removed from the cache again. But I have no problem with this approach. I want to note that this is a silent data corruption for reads. generic_file_buffered_read() has a reference to the page, so this patch will fix it, but before it could be copying the wrong data to userspace. Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>