On Sat, Oct 12, 2024 at 9:55 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > On Fri, Oct 11, 2024 at 03:34:33PM GMT, Joanne Koong wrote: > > Currently in shrink_folio_list(), reclaim for folios under writeback > > falls into 3 different cases: > > 1) Reclaim is encountering an excessive number of folios under > > writeback and this folio has both the writeback and reclaim flags > > set > > 2) Dirty throttling is enabled (this happens if reclaim through cgroup > > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > > if reclaim is on the root cgroup), or if the folio is not marked for > > immediate reclaim, or if the caller does not have __GFP_FS (or > > __GFP_IO if it's going to swap) set > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag > > set and the caller did not have __GFP_FS (or __GFP_IO if swap) set > > > > In cases 1) and 2), we activate the folio and skip reclaiming it while > > in case 3), we wait for writeback to finish on the folio and then try > > to reclaim the folio again. In case 3, we wait on writeback because > > cgroupv1 does not have dirty folio throttling, as such this is a > > mitigation against the case where there are too many folios in writeback > > with nothing else to reclaim. > > > > The issue is that for filesystems where writeback may block, sub-optimal > > workarounds need to be put in place to avoid potential deadlocks that may > > arise from the case where reclaim waits on writeback. (Even though case > > 3 above is rare given that legacy cgroupv1 is on its way to being > > deprecated, this case still needs to be accounted for) > > > > For example, for FUSE filesystems, when a writeback is triggered on a > > folio, a temporary folio is allocated and the pages are copied over to > > this temporary folio so that writeback can be immediately cleared on the > > original folio. This additionally requires an internal rb tree to keep > > track of writeback state on the temporary folios. Benchmarks show > > roughly a ~20% decrease in throughput from the overhead incurred with 4k > > block size writes. The temporary folio is needed here in order to avoid > > the following deadlock if reclaim waits on writeback: > > * single-threaded FUSE server is in the middle of handling a request that > > needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback (eg falls into case 3 > > above) that needs to be written back to the fuse server > > * the FUSE server can't write back the folio since it's stuck in direct > > reclaim > > > > This commit allows filesystems to set a ASOP_NO_RECLAIM_IN_WRITEBACK > > flag in the address_space_operations struct to signify that reclaim > > should not happen when the folio is already in writeback. This only has > > effects on the case where cgroupv1 memcg encounters a folio under > > writeback that already has the reclaim flag set (eg case 3 above), and > > allows for the suboptimal workarounds added to address the "reclaim wait > > on writeback" deadlock scenario to be removed. > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > --- > > include/linux/fs.h | 14 ++++++++++++++ > > mm/vmscan.c | 6 ++++-- > > 2 files changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index e3c603d01337..808164e3dd84 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -394,7 +394,10 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb) > > return kiocb->ki_complete == NULL; > > } > > > > +typedef unsigned int __bitwise asop_flags_t; > > + > > struct address_space_operations { > > + asop_flags_t asop_flags; > > int (*writepage)(struct page *page, struct writeback_control *wbc); > > int (*read_folio)(struct file *, struct folio *); > > > > @@ -438,6 +441,12 @@ struct address_space_operations { > > int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); > > }; > > > > +/** > > + * This flag is only to be used by filesystems whose folios cannot be > > + * reclaimed when in writeback (eg fuse) > > + */ > > +#define ASOP_NO_RECLAIM_IN_WRITEBACK ((__force asop_flags_t)(1 << 0)) > > + > > extern const struct address_space_operations empty_aops; > > > > /** > > @@ -586,6 +595,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) > > atomic_inc(&mapping->i_mmap_writable); > > } > > > > +static inline bool mapping_no_reclaim_in_writeback(struct address_space *mapping) > > +{ > > + return mapping->a_ops->asop_flags & ASOP_NO_RECLAIM_IN_WRITEBACK; > > Any reason not to add this flag in enum mapping_flags and use > mapping->flags field instead of adding a field in struct > address_space_operations? No, thanks for the suggestion - I really like your idea of adding this to enum mapping_flags instead as AS_NO_WRITEBACK_RECLAIM. I don't know why I didn't see mapping_flags when I was looking at this. I'll make this change for v2. Thanks, Joanne > > > +} > > + > > /* > > * Use sequence counter to get consistent i_size on 32-bit processors. > > */ > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 749cdc110c74..2beffbdae572 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (writeback && folio_test_reclaim(folio)) > > stat->nr_congested += nr_pages; > > > > + mapping = folio_mapping(folio); > > + > > /* > > * If a folio at the tail of the LRU is under writeback, there > > * are three cases to consider. > > @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > /* Case 2 above */ > > } else if (writeback_throttling_sane(sc) || > > !folio_test_reclaim(folio) || > > - !may_enter_fs(folio, sc->gfp_mask)) { > > + !may_enter_fs(folio, sc->gfp_mask) || > > + (mapping && mapping_no_reclaim_in_writeback(mapping))) { > > /* > > * This is slightly racy - > > * folio_end_writeback() might have > > @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (folio_maybe_dma_pinned(folio)) > > goto activate_locked; > > > > - mapping = folio_mapping(folio); > > if (folio_test_dirty(folio)) { > > /* > > * Only kswapd can writeback filesystem folios > > -- > > 2.43.5 > >