Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()

Roman Gushchin <roman.gushchin@xxxxxxxxx> · Fri, 19 Jul 2024 17:25:09 +0000

On Fri, Jul 19, 2024 at 01:02:06PM -0400, Johannes Weiner wrote:
> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> > [BACKGROUND]
> > The function filemap_add_folio() charges the memory cgroup,
> > as we assume all page caches are accessible by user space progresses
> > thus needs the cgroup accounting.
> > 
> > However btrfs is a special case, it has a very large metadata thanks to
> > its support of data csum (by default it's 4 bytes per 4K data, and can
> > be as large as 32 bytes per 4K data).
> > This means btrfs has to go page cache for its metadata pages, to take
> > advantage of both cache and reclaim ability of filemap.
> > 
> > This has a tiny problem, that all btrfs metadata pages have to go through
> > the memcgroup charge, even all those metadata pages are not
> > accessible by the user space, and doing the charging can introduce some
> > latency if there is a memory limits set.
> > 
> > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> > charge situation so that metadata pages won't really be limited by
> > memcgroup.
> > 
> > [ENHANCEMENT]
> > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> > memory cgroup to attach metadata pages.
> > 
> > With root memory cgroup, we directly skip the charging part, and only
> > rely on __GFP_NOFAIL for the real memory allocation part.
> > 
> > Suggested-by: Michal Hocko <mhocko@xxxxxxxx>
> > Suggested-by: Vlastimil Babka (SUSE) <vbabka@xxxxxxxxxx>
> > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> > ---
> >  fs/btrfs/extent_io.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index aa7f8148cd0d..cfeed7673009 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  
> >  	struct btrfs_fs_info *fs_info = eb->fs_info;
> >  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> > +	struct mem_cgroup *old_memcg;
> >  	const unsigned long index = eb->start >> PAGE_SHIFT;
> >  	struct folio *existing_folio = NULL;
> >  	int ret;
> > @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  	ASSERT(eb->folios[i]);
> >  
> >  retry:
> > +	/*
> > +	 * Btree inode is a btrfs internal inode, and not exposed to any
> > +	 * user.
> > +	 * Furthermore we do not want any cgroup limits on this inode.
> > +	 * So we always use root_mem_cgroup as our active memcg when attaching
> > +	 * the folios.
> > +	 */
> > +	old_memcg = set_active_memcg(root_mem_cgroup);
> >  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
> >  				GFP_NOFS | __GFP_NOFAIL);
> > +	set_active_memcg(old_memcg);
> 
> It looks correct. But it's going through all dance to set up
> current->active_memcg, then have the charge path look that up,
> css_get(), call try_charge() only to bail immediately, css_put(), then
> update current->active_memcg again. All those branches are necessary
> when we want to charge to a "real" other cgroup. But in this case, we
> always know we're not charging, so it seems uncalled for.
> 
> Wouldn't it be a lot simpler (and cheaper) to have a
> filemap_add_folio_nocharge()?

Time to restore GFP_NOACCOUNT? I think it might be useful for allocating objects
which are shared across the entire system and/or unlikely will go away under
the memory pressure.