On Fri, Mar 22, 2024 at 11:20 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Fri, Mar 22, 2024 at 09:52:36AM +0800, Zhaoyang Huang wrote: > > Thanks for the comments. fix the typo and update the timing sequence > > by amending possible preempt points to have the refcnt make sense. > > > > 0. Thread_bad gets the folio by find_get_entry and preempted before > > take refcnt(could be the second round scan of > > truncate_inode_pages_range) > > refcnt == 1(page_cache), PG_lru == true, PG_lock == false > > find_get_entry > > folio = xas_find > > <preempted> > > folio_try_get_rcu > > > > 1. Thread_filemap get the folio via > > filemap_map_pages->next_uptodate_folio->xas_next_entry and gets preempted > > refcnt == 1(page_cache), PG_lru == true, PG_lock == false > > filemap_map_pages > > next_uptodate_folio > > xas_next_entry > > <preempted> > > folio_try_get_rcu > > > > 2. Thread_truncate get the folio via > > truncate_inode_pages_range->find_lock_entries > > refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true > > > > 3. Thread_truncate proceed to truncate_cleanup_folio > > refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true > > > > 4. Thread_truncate proceed to delete_from_page_cache_batch > > refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == true > > > > 4.1 folio_unlock > > refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == false > > OK, so by the time we get to folio_unlock(), the folio has been removed > from the i_pages xarray. > > > 5. Thread_filemap schedule back from '1' and proceed to setup a pte > > and have folio->_mapcnt = 0 & folio->refcnt += 1 > > refcnt == 1->2(+fbatch_filemap)->3->2(pte, fbatch_truncate), > > PG_lru == true, PG_lock == true->false > > This line succeeds (in next_uptodate_folio): > if (!folio_try_get_rcu(folio)) > continue; > but then this fails: > > if (unlikely(folio != xas_reload(xas))) > goto skip; > skip: > folio_put(folio); > > because xas_reload() will return NULL due to the folio being deleted > in step 4. So we never get to the point where we set up a PTE. > > There should be no way to create a new PTE for a folio which has been > removed from the page cache. Bugs happen, of course, but I don't see > one yet. > > > 6. Thread_madv clear folio's PG_lru by > > madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru > > refcnt == 2(pte,fbatch_truncate), PG_lru == false, PG_lock == false > > > > 7. Thread_truncate call folio_fbatch_release and failed in freeing > > folio as refcnt not reach 0 > > refcnt == 1(pte), PG_lru == false, PG_lock == false > > ********folio becomes an orphan here which is not on the page cache > > but on the task's VM********** > > > > 8. Thread_bad scheduled back from '0' to be collected in fbatch_bad > > refcnt == 2(pte, fbatch_bad), PG_lru == false, PG_lock == true > > > > 9. Thread_bad clear one refcnt wrongly when doing filemap_remove_folio > > as it take this refcnt as the page cache one > > refcnt == 1(fbatch_bad), PG_lru == false, PG_lock == true->false > > truncate_inode_folio > > filemap_remove_folio > > filemap_free_folio > > ******refcnt decreased wrongly here by being taken as the page cache one ****** > > > > 10. Thread_bad calls release_pages(fbatch_bad) and has the folio > > introduce the bug. > > release_pages > > folio_put_testzero == true > > folio_test_lru == false > > list_add(folio->lru, pages_to_free) ok. It seems like madvise is robust enough to leave no BUGs. I presume another two scenarios which call folio_isloate_lru by any other ways but PTE. Besides, scenario 2 reminds me of a previous bug reported by me as find_get_entry entered in a livelock where the folio's refcnt == 0 but remains at xarray which causes the reset->retry loops forever. I would like to reply in that thread for more details. Scenario 1: 0. Thread_bad gets the folio by find_get_entry and preempted before folio_lock (could be the second round scan of truncate_inode_pages_range) refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false folio = find_get_entry folio_try_get_rcu <preempted> folio_try_lock 1. Thread_truncate get the folio via truncate_inode_pages_range->find_lock_entries refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru == true, PG_lock == true 2. Thread_truncate proceed to truncate_cleanup_folio refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru == true, PG_lock == true 3. Thread_truncate proceed to delete_from_page_cache_batch refcnt == 2(fbatch_bad, fbatch_truncate), PG_lru == true, PG_lock == true 4 folio_unlock refcnt == 2(fbatch_bad, fbatch_truncate), PG_lru == true, PG_lock == false 5. Thread_bad schedule back from step 0 and clear one refcnt wrongly when doing truncate_inode_folio->filemap_remove_folio as it take this refcnt as the page cache one refcnt == 1'(fbatch_truncate), PG_lru == false, PG_lock == true folio = find_get_entry folio_try_get_rcu folio_try_lock truncate_inode_folio filemap_remove_folio <preempted> 6. Thread_isolate get one refcnt and call folio_isolate_lru(could be any process) refcnt == 2'(fbatch_truncate, thread_isolate), PG_lru == true, PG_lock == true 7. Thread_isolate proceed to clear PG_lru and get preempted before folio_get refcnt == 2'(fbatch_truncate, thread_isolate), PG_lru == false, PG_lock == true folio_test_clear_folio <preempted> folio_get 8. Thread_bad scheduling back from step 5 and proceed to drop one refcnt refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == true folio = find_get_entry folio_try_get_rcu folio_try_lock truncate_inode_folio filemap_remove_folio folio_unlock <preempted> 9. Thread_truncate schedule back from step 3 and proceed to drop one refcnt by release_pages and hit the BUG refcnt == 0'(thread_isolate), PG_lru == false, PG_lock == false Scenario 2: 0. Thread_bad gets the folio by find_get_entry and preempted before folio_lock (could be the second round scan of truncate_inode_pages_range) refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false folio = find_get_entry folio_try_get_rcu <preempted> folio_try_lock 1. Thread_readahead remove the folio from page cache and drop one refcnt by filemap_remove_folio(get rid of the folios which failed to launch IO during readahead) refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == true 2. folio_unlock refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == false 3. Thread_isolate get one refcnt and call folio_isolate_lru(could be any process) refcnt == 2(fbatch_bad, thread_isolate), PG_lru == true, PG_lock == false 4. Thread_bad schedule back from step 0 and clear one refcnt wrongly when doing truncate_inode_folio->filemap_remove_folio as it take this refcnt as the page cache one refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == false find_get_entries folio = find_get_entry folio_try_get_rcu folio_lock <no check as folio->mapping != mapping as folio_lock_entries does> truncate_inode_folio filemap_remove_folio <preempted> 5. Thread_isolate proceed to clear PG_lru and get preempted before folio_get refcnt == 1'(fbatch_truncate, thread_isolate), PG_lru == false, PG_lock == false folio_test_clear_folio <preempted> folio_get 6. Thread_bad schedule back from step 4 and proceed to drop one refcnt by release_pages and hit the BUG refcnt == 0'(thread_isolate), PG_lru == false, PG_lock == false