On Wed, Mar 27, 2024 at 8:31 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Wed, Mar 27, 2024 at 09:25:59AM +0800, Zhaoyang Huang wrote: > > > Ignoring any other thread, you're basically saying that there's a > > > refcount imbalance here. Which means we'd hit an assert (that folio > > > refcount went below zero) in the normal case where another thread wasn't > > > simultaneously trying to do anything. > > Theoretically Yes but it is rare in practice as aops->readahead will > > launch all pages to IO under most scenarios. > > Rare, but this path has been tested. > > > read_pages > > aops->readahead[1] > > ... > > while (folio = readahead_folio)[2] > > filemap_remove_folio > > > > IMO, according to the comments of readahead_page, the refcnt > > represents page cache dropped in [1] makes sense for two reasons, '1. > > The folio is going to do IO and is locked until IO done;2. The refcnt > > will be added back when found again from the page cache and then serve > > for PTE or vfs' while it doesn't make sense in [2] as the refcnt of > > page cache will be dropped in filemap_remove_folio > > > > * Context: The page is locked and has an elevated refcount. The caller > > * should decreases the refcount once the page has been submitted for I/O > > * and unlock the page once all I/O to that page has completed. > > * Return: A pointer to the next page, or %NULL if we are done. > > Follow the refcount through. > > In page_cache_ra_unbounded(): > > folio = filemap_alloc_folio(gfp_mask, 0); > (folio has refcount 1) > ret = filemap_add_folio(mapping, folio, index + i, gfp_mask); > (folio has refcount 2) > > Then we call read_pages() > First we call ->readahead() which for some reason stops early. > Then we call readahead_folio() which calls folio_put() > (folio has refcount 1) > Then we call folio_get() > (folio has refcount 2) > Then we call filemap_remove_folio() > (folio has refcount 1) > Then we call folio_unlock() > Then we call folio_put() ok, I missed the refcnt from alloc_pages. However, I still think it is a bug to call readahead_folio in read_pages as the refcnt obtained by alloc_pages should be its final guard which is paired to the one which checked in shrink_folio_list->__remove_mapping->folio_ref_freeze(2)(this 2 represent alloc_pages & page cache). If we removed this one without isolating the folio from LRU, the following race could happen. Furthermore, the refcnt dropped in the readahead_folio represents page cache, it doesn't make sense to drop it twice in read_pages. 0. Thread_readahead: folio_put() folio_put_test_zero() == true __folio_put() folio_test_lru() == true <preempted> 1. Thread_isolate folio_isolate_lru folio_test_clear_lru() lruvec_del_folio() 2. Thread_readahead folio_put() folio_put_test_zero() == true __folio_put folio_test_lru() == true <schedule back from 0> lruvec_del_folio() > (folio has refcount 0 and is freed) > > Yes, other things can happen in there to increment the refcount, so this > folio_put() might not be the last put, but we hold the folio locked the > entire time, so many things which might be attempted will block on the > folio lock. In particular, nobody can remove it from the page cache, > so its refcount cannot reach 0 until the last folio_put() of the > sequence.