On Fri, Oct 11, 2024 at 03:34:33PM GMT, Joanne Koong wrote: > Currently in shrink_folio_list(), reclaim for folios under writeback > falls into 3 different cases: > 1) Reclaim is encountering an excessive number of folios under > writeback and this folio has both the writeback and reclaim flags > set > 2) Dirty throttling is enabled (this happens if reclaim through cgroup > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > if reclaim is on the root cgroup), or if the folio is not marked for > immediate reclaim, or if the caller does not have __GFP_FS (or > __GFP_IO if it's going to swap) set > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag > set and the caller did not have __GFP_FS (or __GFP_IO if swap) set > > In cases 1) and 2), we activate the folio and skip reclaiming it while > in case 3), we wait for writeback to finish on the folio and then try > to reclaim the folio again. In case 3, we wait on writeback because > cgroupv1 does not have dirty folio throttling, as such this is a > mitigation against the case where there are too many folios in writeback > with nothing else to reclaim. > > The issue is that for filesystems where writeback may block, sub-optimal > workarounds need to be put in place to avoid potential deadlocks that may > arise from the case where reclaim waits on writeback. (Even though case > 3 above is rare given that legacy cgroupv1 is on its way to being > deprecated, this case still needs to be accounted for) > > For example, for FUSE filesystems, when a writeback is triggered on a > folio, a temporary folio is allocated and the pages are copied over to > this temporary folio so that writeback can be immediately cleared on the > original folio. This additionally requires an internal rb tree to keep > track of writeback state on the temporary folios. Benchmarks show > roughly a ~20% decrease in throughput from the overhead incurred with 4k > block size writes. The temporary folio is needed here in order to avoid > the following deadlock if reclaim waits on writeback: > * single-threaded FUSE server is in the middle of handling a request that > needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback (eg falls into case 3 > above) that needs to be written back to the fuse server > * the FUSE server can't write back the folio since it's stuck in direct > reclaim > > This commit allows filesystems to set a ASOP_NO_RECLAIM_IN_WRITEBACK > flag in the address_space_operations struct to signify that reclaim > should not happen when the folio is already in writeback. This only has > effects on the case where cgroupv1 memcg encounters a folio under > writeback that already has the reclaim flag set (eg case 3 above), and > allows for the suboptimal workarounds added to address the "reclaim wait > on writeback" deadlock scenario to be removed. > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > --- > include/linux/fs.h | 14 ++++++++++++++ > mm/vmscan.c | 6 ++++-- > 2 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e3c603d01337..808164e3dd84 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -394,7 +394,10 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb) > return kiocb->ki_complete == NULL; > } > > +typedef unsigned int __bitwise asop_flags_t; > + > struct address_space_operations { > + asop_flags_t asop_flags; > int (*writepage)(struct page *page, struct writeback_control *wbc); > int (*read_folio)(struct file *, struct folio *); > > @@ -438,6 +441,12 @@ struct address_space_operations { > int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); > }; > > +/** > + * This flag is only to be used by filesystems whose folios cannot be > + * reclaimed when in writeback (eg fuse) > + */ > +#define ASOP_NO_RECLAIM_IN_WRITEBACK ((__force asop_flags_t)(1 << 0)) > + > extern const struct address_space_operations empty_aops; > > /** > @@ -586,6 +595,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) > atomic_inc(&mapping->i_mmap_writable); > } > > +static inline bool mapping_no_reclaim_in_writeback(struct address_space *mapping) > +{ > + return mapping->a_ops->asop_flags & ASOP_NO_RECLAIM_IN_WRITEBACK; Any reason not to add this flag in enum mapping_flags and use mapping->flags field instead of adding a field in struct address_space_operations? > +} > + > /* > * Use sequence counter to get consistent i_size on 32-bit processors. > */ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 749cdc110c74..2beffbdae572 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (writeback && folio_test_reclaim(folio)) > stat->nr_congested += nr_pages; > > + mapping = folio_mapping(folio); > + > /* > * If a folio at the tail of the LRU is under writeback, there > * are three cases to consider. > @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* Case 2 above */ > } else if (writeback_throttling_sane(sc) || > !folio_test_reclaim(folio) || > - !may_enter_fs(folio, sc->gfp_mask)) { > + !may_enter_fs(folio, sc->gfp_mask) || > + (mapping && mapping_no_reclaim_in_writeback(mapping))) { > /* > * This is slightly racy - > * folio_end_writeback() might have > @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (folio_maybe_dma_pinned(folio)) > goto activate_locked; > > - mapping = folio_mapping(folio); > if (folio_test_dirty(folio)) { > /* > * Only kswapd can writeback filesystem folios > -- > 2.43.5 >