Re: [PATCH 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock

Joanne Koong <joannelkoong@xxxxxxxxx> · Mon, 14 Oct 2024 10:18:05 -0700

On Sat, Oct 12, 2024 at 9:55 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> On Fri, Oct 11, 2024 at 03:34:33PM GMT, Joanne Koong wrote:
> > Currently in shrink_folio_list(), reclaim for folios under writeback
> > falls into 3 different cases:
> > 1) Reclaim is encountering an excessive number of folios under
> >    writeback and this folio has both the writeback and reclaim flags
> >    set
> > 2) Dirty throttling is enabled (this happens if reclaim through cgroup
> >    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> >    if reclaim is on the root cgroup), or if the folio is not marked for
> >    immediate reclaim, or if the caller does not have __GFP_FS (or
> >    __GFP_IO if it's going to swap) set
> > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
> >    set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
> >
> > In cases 1) and 2), we activate the folio and skip reclaiming it while
> > in case 3), we wait for writeback to finish on the folio and then try
> > to reclaim the folio again. In case 3, we wait on writeback because
> > cgroupv1 does not have dirty folio throttling, as such this is a
> > mitigation against the case where there are too many folios in writeback
> > with nothing else to reclaim.
> >
> > The issue is that for filesystems where writeback may block, sub-optimal
> > workarounds need to be put in place to avoid potential deadlocks that may
> > arise from the case where reclaim waits on writeback. (Even though case
> > 3 above is rare given that legacy cgroupv1 is on its way to being
> > deprecated, this case still needs to be accounted for)
> >
> > For example, for FUSE filesystems, when a writeback is triggered on a
> > folio, a temporary folio is allocated and the pages are copied over to
> > this temporary folio so that writeback can be immediately cleared on the
> > original folio. This additionally requires an internal rb tree to keep
> > track of writeback state on the temporary folios. Benchmarks show
> > roughly a ~20% decrease in throughput from the overhead incurred with 4k
> > block size writes. The temporary folio is needed here in order to avoid
> > the following deadlock if reclaim waits on writeback:
> > * single-threaded FUSE server is in the middle of handling a request that
> >   needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback (eg falls into case 3
> >   above) that needs to be written back to the fuse server
> > * the FUSE server can't write back the folio since it's stuck in direct
> >   reclaim
> >
> > This commit allows filesystems to set a ASOP_NO_RECLAIM_IN_WRITEBACK
> > flag in the address_space_operations struct to signify that reclaim
> > should not happen when the folio is already in writeback. This only has
> > effects on the case where cgroupv1 memcg encounters a folio under
> > writeback that already has the reclaim flag set (eg case 3 above), and
> > allows for the suboptimal workarounds added to address the "reclaim wait
> > on writeback" deadlock scenario to be removed.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> > ---
> >  include/linux/fs.h | 14 ++++++++++++++
> >  mm/vmscan.c        |  6 ++++--
> >  2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e3c603d01337..808164e3dd84 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -394,7 +394,10 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
> >       return kiocb->ki_complete == NULL;
> >  }
> >
> > +typedef unsigned int __bitwise asop_flags_t;
> > +
> >  struct address_space_operations {
> > +     asop_flags_t asop_flags;
> >       int (*writepage)(struct page *page, struct writeback_control *wbc);
> >       int (*read_folio)(struct file *, struct folio *);
> >
> > @@ -438,6 +441,12 @@ struct address_space_operations {
> >       int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
> >  };
> >
> > +/**
> > + * This flag is only to be used by filesystems whose folios cannot be
> > + * reclaimed when in writeback (eg fuse)
> > + */
> > +#define ASOP_NO_RECLAIM_IN_WRITEBACK ((__force asop_flags_t)(1 << 0))
> > +
> >  extern const struct address_space_operations empty_aops;
> >
> >  /**
> > @@ -586,6 +595,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
> >       atomic_inc(&mapping->i_mmap_writable);
> >  }
> >
> > +static inline bool mapping_no_reclaim_in_writeback(struct address_space *mapping)
> > +{
> > +     return mapping->a_ops->asop_flags & ASOP_NO_RECLAIM_IN_WRITEBACK;
>
> Any reason not to add this flag in enum mapping_flags and use
> mapping->flags field instead of adding a field in struct
> address_space_operations?

No, thanks for the suggestion - I really like your idea of adding this
to enum mapping_flags instead as AS_NO_WRITEBACK_RECLAIM. I don't know
why I didn't see mapping_flags when I was looking at this. I'll make
this change for v2.

Thanks,
Joanne

>
> > +}
> > +
> >  /*
> >   * Use sequence counter to get consistent i_size on 32-bit processors.
> >   */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 749cdc110c74..2beffbdae572 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               if (writeback && folio_test_reclaim(folio))
> >                       stat->nr_congested += nr_pages;
> >
> > +             mapping = folio_mapping(folio);
> > +
> >               /*
> >                * If a folio at the tail of the LRU is under writeback, there
> >                * are three cases to consider.
> > @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                       /* Case 2 above */
> >                       } else if (writeback_throttling_sane(sc) ||
> >                           !folio_test_reclaim(folio) ||
> > -                         !may_enter_fs(folio, sc->gfp_mask)) {
> > +                         !may_enter_fs(folio, sc->gfp_mask) ||
> > +                         (mapping && mapping_no_reclaim_in_writeback(mapping))) {
> >                               /*
> >                                * This is slightly racy -
> >                                * folio_end_writeback() might have
> > @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               if (folio_maybe_dma_pinned(folio))
> >                       goto activate_locked;
> >
> > -             mapping = folio_mapping(folio);
> >               if (folio_test_dirty(folio)) {
> >                       /*
> >                        * Only kswapd can writeback filesystem folios
> > --
> > 2.43.5
> >