On Thu, Nov 07, 2024 at 03:56:10PM -0800, Joanne Koong wrote: > Currently in shrink_folio_list(), reclaim for folios under writeback > falls into 3 different cases: > 1) Reclaim is encountering an excessive number of folios under > writeback and this folio has both the writeback and reclaim flags > set > 2) Dirty throttling is enabled (this happens if reclaim through cgroup > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > if reclaim is on the root cgroup), or if the folio is not marked for > immediate reclaim, or if the caller does not have __GFP_FS (or > __GFP_IO if it's going to swap) set > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag > set and the caller did not have __GFP_FS (or __GFP_IO if swap) set > > In cases 1) and 2), we activate the folio and skip reclaiming it while > in case 3), we wait for writeback to finish on the folio and then try > to reclaim the folio again. In case 3, we wait on writeback because > cgroupv1 does not have dirty folio throttling, as such this is a > mitigation against the case where there are too many folios in writeback > with nothing else to reclaim. > > The issue is that for filesystems where writeback may block, sub-optimal > workarounds may need to be put in place to avoid a potential deadlock > that may arise from reclaim waiting on writeback. (Even though case 3 > above is rare given that legacy cgroupv1 is on its way to being > deprecated, this case still needs to be accounted for). For example, for > FUSE filesystems, a temp page gets allocated per dirty page and the > contents of the dirty page are copied over to the temp page so that > writeback can be immediately cleared on the dirty page in order to avoid > the following deadlock: > * single-threaded FUSE server is in the middle of handling a request that > needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback (eg falls into case 3 > above) that needs to be written back to the FUSE server > * the FUSE server can't write back the folio since it's stuck in direct > reclaim > > In this commit, if legacy memcg encounters a folio with the reclaim flag > set (eg case 3) and the folio belongs to a mapping that has the > AS_WRITEBACK_MAY_BLOCK flag set, the folio will be activated and skip > reclaim (eg default to behavior in case 2) instead. > > This allows for the suboptimal workarounds added to address the > "reclaim wait on writeback" deadlock scenario to be removed. > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> This looks good just one nit below. Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > --- > mm/vmscan.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 749cdc110c74..e9755cb7211b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (writeback && folio_test_reclaim(folio)) > stat->nr_congested += nr_pages; > > + mapping = folio_mapping(folio); Move the above line within folio_test_writeback() check block. > + > /* > * If a folio at the tail of the LRU is under writeback, there > * are three cases to consider. > @@ -1129,8 +1131,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > * 2) Global or new memcg reclaim encounters a folio that is > * not marked for immediate reclaim, or the caller does not > * have __GFP_FS (or __GFP_IO if it's simply going to swap, > - * not to fs). In this case mark the folio for immediate > - * reclaim and continue scanning. > + * not to fs), or writebacks in the mapping may block. > + * In this case mark the folio for immediate reclaim and > + * continue scanning. > * > * Require may_enter_fs() because we would wait on fs, which > * may not have submitted I/O yet. And the loop driver might > @@ -1165,7 +1168,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* Case 2 above */ > } else if (writeback_throttling_sane(sc) || > !folio_test_reclaim(folio) || > - !may_enter_fs(folio, sc->gfp_mask)) { > + !may_enter_fs(folio, sc->gfp_mask) || > + (mapping && mapping_writeback_may_block(mapping))) { > /* > * This is slightly racy - > * folio_end_writeback() might have > -- > 2.43.5 >