Re: [PATCH 1/2] mm: Add memalloc_nowait_{save,restore}

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 15 Aug 2024 11:38:05 +0800




On Thu, Aug 15, 2024 at 10:54 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Wed, Aug 14, 2024 at 03:32:26PM +0800, Yafang Shao wrote:
> > On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote:
> > > > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > > > reclamation.
> > > > >
> > > > > Readahead already uses this context:
> > > > >
> > > > > static inline gfp_t readahead_gfp_mask(struct address_space *x)
> > > > > {
> > > > >         return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
> > > > > }
> > > > >
> > > > > and __GFP_NORETRY means minimal direct reclaim should be performed.
> > > > > Most filesystems already have GFP_NOFS context from
> > > > > mapping_gfp_mask(), so how much difference does completely avoiding
> > > > > direct reclaim actually make under memory pressure?
> > > >
> > > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies
> > > > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO,
> > > > right?
> > >
> > > There's a *lot* more difference between __GFP_NORETRY and
> > > __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to
> > > describe to me what the differences are; What I'm asking you is this:
> > >
> > > > > i.e. doing some direct reclaim without blocking when under memory
> > > > > pressure might actually give better performance than skipping direct
> > > > > reclaim and aborting readahead altogether....
> > > > >
> > > > > This really, really needs some numbers (both throughput and IO
> > > > > latency histograms) to go with it because we have no evidence either
> > > > > way to determine what is the best approach here.
> > >
> > > Put simply: does the existing readahead mechanism give better results
> > > than the proposed one, and if so, why wouldn't we just reenable
> > > readahead unconditionally instead of making it behave differently
> > > for this specific case?
> >
> > Are you suggesting we compare the following change with the current proposal?
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index fd34b5755c0b..ced74b1b350d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct
> > kiocb *ki, rwf_t flags,
> >         if (flags & RWF_NOWAIT) {
> >                 if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
> >                         return -EOPNOTSUPP;
> > -               kiocb_flags |= IOCB_NOIO;
> >         }
> >         if (flags & RWF_ATOMIC) {
> >                 if (rw_type != WRITE)
>
> Yes.
>
> > Doesn't unconditional readahead break the semantics of RWF_NOWAIT,
> > which is supposed to avoid waiting for I/O? For example, it might
> > trigger a pageout for a dirty page.
>
> Yes, but only for *some filesystems* in *some configurations*.
> Readahead allocation behaviour is specifically controlled by the gfp
> mask set on the mapping by the filesystem at inode instantiation
> time. i.e. via a call to mapping_set_gfp_mask().
>
> XFS, for one, always clears __GFP_FS from this mask, and several
> other filesystems set it to GFP_NOFS. Filesystems that do this will
> not do pageout for a dirty page during memory allocation.
>
> Further, memory reclaim can not write dirty pages to a filesystem
> without a ->writepage implementation. ->writepage is almost
> completely gone - neither ext4, btrfs or XFS have a ->writepage
> implementation anymore - with f2fs being the only "major" filesystem
> with a ->writepage implementation remaining.
>
> IOWs, for most readahead cases right now, direct memory reclaim will
> not issue writeback IO on dirty cached file pages and in the near
> future that will change to -never-.
>
> That means the only IO that direct reclaim will be able to do is for
> swapping and compaction. Both of these can be prevented simply by
> setting a GFP_NOIO allocation context. IOWs, in the not-to-distant
> future we won't have to turn direct reclaim off to prevent IO from
> and blocking in direct reclaim during readahead - GFP_NOIO context
> will be all that is necessary for IOCB_NOWAIT readahead.
>
> That's why I'm asking if just doing readahead as it stands from
> RWF_NOWAIT causes any obvious problems. I think we really only need
> need GFP_NOIO | __GFP_NORETRY allocation context for NOWAIT
> readahead IO, and that's something we already have a context API
> for.

Understood, thanks for your explanation.
so we need below changes,

@@ -2526,8 +2528,12 @@ static int filemap_get_pages(struct kiocb
*iocb, size_t count,
        if (!folio_batch_count(fbatch)) {
                if (iocb->ki_flags & IOCB_NOIO)
                        return -EAGAIN;
+               if (iocb->ki_flags & IOCB_NOWAIT)
+                       flags = memalloc_noio_save();
                page_cache_sync_readahead(mapping, ra, filp, index,
                                last_index - index);
+               if (iocb->ki_flags & IOCB_NOWAIT)
+                       memalloc_noio_restore(flags);
                filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
        }
        if (!folio_batch_count(fbatch)) {

What data would you recommend collecting after implementing the above
change? Should we measure the latency of preadv2(2) under high memory
pressure? Although latency can vary, it seems we have no choice but to
use memalloc_noio_save instead of memalloc_nowait_save, as the MM
folks are not in favor of the latter.

--
Regards
Yafang