Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()

Johannes Weiner <hannes@xxxxxxxxxxx> · Fri, 19 Jul 2024 13:02:06 -0400

On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> [BACKGROUND]
> The function filemap_add_folio() charges the memory cgroup,
> as we assume all page caches are accessible by user space progresses
> thus needs the cgroup accounting.
> 
> However btrfs is a special case, it has a very large metadata thanks to
> its support of data csum (by default it's 4 bytes per 4K data, and can
> be as large as 32 bytes per 4K data).
> This means btrfs has to go page cache for its metadata pages, to take
> advantage of both cache and reclaim ability of filemap.
> 
> This has a tiny problem, that all btrfs metadata pages have to go through
> the memcgroup charge, even all those metadata pages are not
> accessible by the user space, and doing the charging can introduce some
> latency if there is a memory limits set.
> 
> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> charge situation so that metadata pages won't really be limited by
> memcgroup.
> 
> [ENHANCEMENT]
> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> memory cgroup to attach metadata pages.
> 
> With root memory cgroup, we directly skip the charging part, and only
> rely on __GFP_NOFAIL for the real memory allocation part.
> 
> Suggested-by: Michal Hocko <mhocko@xxxxxxxx>
> Suggested-by: Vlastimil Babka (SUSE) <vbabka@xxxxxxxxxx>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
>  fs/btrfs/extent_io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index aa7f8148cd0d..cfeed7673009 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>  
>  	struct btrfs_fs_info *fs_info = eb->fs_info;
>  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> +	struct mem_cgroup *old_memcg;
>  	const unsigned long index = eb->start >> PAGE_SHIFT;
>  	struct folio *existing_folio = NULL;
>  	int ret;
> @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>  	ASSERT(eb->folios[i]);
>  
>  retry:
> +	/*
> +	 * Btree inode is a btrfs internal inode, and not exposed to any
> +	 * user.
> +	 * Furthermore we do not want any cgroup limits on this inode.
> +	 * So we always use root_mem_cgroup as our active memcg when attaching
> +	 * the folios.
> +	 */
> +	old_memcg = set_active_memcg(root_mem_cgroup);
>  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
>  				GFP_NOFS | __GFP_NOFAIL);
> +	set_active_memcg(old_memcg);

It looks correct. But it's going through all dance to set up
current->active_memcg, then have the charge path look that up,
css_get(), call try_charge() only to bail immediately, css_put(), then
update current->active_memcg again. All those branches are necessary
when we want to charge to a "real" other cgroup. But in this case, we
always know we're not charging, so it seems uncalled for.

Wouldn't it be a lot simpler (and cheaper) to have a
filemap_add_folio_nocharge()?