On Sun, Mar 24, 2024 at 07:14:27PM +0800, Zhaoyang Huang wrote: > ok. It seems like madvise is robust enough to leave no BUGs. I presume > another two scenarios which call folio_isloate_lru by any other ways > but PTE. Besides, scenario 2 reminds me of a previous bug reported by > me as find_get_entry entered in a livelock where the folio's refcnt == > 0 but remains at xarray which causes the reset->retry loops forever. I > would like to reply in that thread for more details. > > Scenario 1: > 0. Thread_bad gets the folio by find_get_entry and preempted before > folio_lock (could be the second round scan of > truncate_inode_pages_range) > refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false > folio = find_get_entry > folio_try_get_rcu > <preempted> > folio_try_lock > > 1. Thread_truncate get the folio via > truncate_inode_pages_range->find_lock_entries > refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru == > true, PG_lock == true Hang on, you can't have two threads in truncate_inode_pages_range() at the same time. I appreciate that we don't have any documentation of that, but if it were possible, we'd see other crashes. Removing the folio from the page cache sets folio->mapping to NULL. And __filemap_remove_folio() uses folio->mapping in filemap_unaccount_folio() and page_cache_delete(), so we'd get NULL pointer dereferences. I see a hint in the DAX code that it's an fs-dependent lock: /* * This gets called from truncate / punch_hole path. As such, the caller * must hold locks protecting against concurrent modifications of the * page cache (usually fs-private i_mmap_sem for writing). Since the * caller has seen a DAX entry for this index, we better find it * at that index as well... */ so maybe that's why there's no lockdep_assert() in truncate_inode_pages_range(), but there should be a comment. > Scenario 2: > 0. Thread_bad gets the folio by find_get_entry and preempted before > folio_lock (could be the second round scan of > truncate_inode_pages_range) > refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false > folio = find_get_entry > folio_try_get_rcu > <preempted> > folio_try_lock > > 1. Thread_readahead remove the folio from page cache and drop one > refcnt by filemap_remove_folio(get rid of the folios which failed to > launch IO during readahead) > refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == true So readaahead inserts the folio locked, and then calls filemap_remove_folio() without having unlocked it. filemap_remove_folio() sets folio->mapping to NULL in page_cache_delete(). When "Thread_bad" wakes up, it gets the folio lock, calls truncate_inode_folio() and sees that folio->mapping != mapping, so it doesn't call filemap_remove_folio(). > 4. Thread_bad schedule back from step 0 and clear one refcnt wrongly > when doing truncate_inode_folio->filemap_remove_folio as it take this > refcnt as the page cache one > refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == false > find_get_entries > folio = find_get_entry > folio_try_get_rcu > folio_lock > <no check as folio->mapping != mapping as folio_lock_entries does> > truncate_inode_folio > filemap_remove_folio > <preempted>