Basically, fault handler releases mmap_sem before requesting readahead and then it is supposed to retry lookup the page from page cache with FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry. However, what happens if the fault handler find a page from page cache and the page has readahead marker but are waiting under writeback? Plus one more condition, it happens under mm_populate which repeats faulting unless it encounters error. So let's assemble conditions below. CPU 1 CPU 2 - first loop mm_populate for () .. ret = populate_vma_page_range __get_user_pages faultin_page handle_mm_fault filemap_fault do_async_mmap_readahead if (PageReadahead(pageA)) maybe_unlock_mmap_for_io up_read(mmap_sem) shrink_page_list pageout SetPageReclaim(=SetPageReadahead)(pageA) writepage SetPageWriteback(pageA) page_cache_async_readahead() ClearPageReadahead(pageA) do_async_mmap_readahead lock_page_maybe_drop_mmap goto out_retry the pageA is reclaimed and new pageB is populated to the file offset and finally has become PG_readahead - second loop __get_user_pages faultin_page handle_mm_fault filemap_fault do_async_mmap_readahead if (PageReadahead(pageB)) maybe_unlock_mmap_for_io up_read(mmap_sem) shrink_page_list pageout SetPageReclaim(=SetPageReadahead)(pageB) writepage SetPageWriteback(pageB) page_cache_async_readahead() ClearPageReadahead(pageB) do_async_mmap_readahead lock_page_maybe_drop_mmap goto out_retry It could be repeated forever so it's livelock. Without involving reclaim, it could happens if ra_pages become zero by fadvise/other threads who have same fd one doing randome while the other one is sequential because page_cache_async_readahead has following condition check like PageWriteback and ra_pages are never synchrnized with fadvise and shrink_readahead_size_eio from other threads. page_cache_async_readahead(struct address_space *mapping, unsigned long req_size) { /* no read-ahead */ if (!ra->ra_pages) return; Thus, we need to limit fault retry from mm_populate like page fault handler. Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations") Reviewed-by: Jan Kara <jack@xxxxxxx> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> --- mm/gup.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 1b521e0ac1de..6f6548c63ad5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1133,7 +1133,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, * * This takes care of mlocking the pages too if VM_LOCKED is set. * - * return 0 on success, negative error code on error. + * return number of pages pinned on success, negative error code on error. * * vma->vm_mm->mmap_sem must be held. * @@ -1196,6 +1196,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) struct vm_area_struct *vma = NULL; int locked = 0; long ret = 0; + bool tried = false; end = start + len; @@ -1226,14 +1227,18 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) * double checks the vma flags, so that it won't mlock pages * if the vma was already munlocked. */ - ret = populate_vma_page_range(vma, nstart, nend, &locked); + ret = populate_vma_page_range(vma, nstart, nend, + tried ? NULL : &locked); if (ret < 0) { if (ignore_errors) { ret = 0; continue; /* continue at next VMA */ } break; - } + } else if (ret == 0) + tried = true; + else + tried = false; nend = nstart + ret * PAGE_SIZE; ret = 0; } -- 2.25.0.265.gbab2e86ba0-goog