Re: [PATCH 0/2] mm: skip memcg for certain address space

Qu Wenruo <wqu@xxxxxxxx> · Thu, 18 Jul 2024 18:22:11 +0930

在 2024/7/18 17:39, Michal Hocko 写道:
On Thu 18-07-24 17:27:05, Qu Wenruo wrote:

在 2024/7/18 16:55, Michal Hocko 写道:
On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
On 7/18/24 12:38 AM, Qu Wenruo wrote:
[...]
Does the folio order has anything related to the problem or just a
higher order makes it more possible?

I didn't spot anything in the memcg charge path that would depend on the
order directly, hm. Also what kernel version was showing these soft lockups?

Correct. Order just defines the number of charges to be reclaimed.
Unlike the page allocator path we do not have any specific requirements
on the memory to be released.

So I guess the higher folio order just brings more pressure to trigger the
problem?

It increases the reclaim target (in number of pages to reclaim). That
might contribute but we are cond_resched-ing in shrink_node_memcgs and
also down the path in shrink_lruvec etc. So higher target shouldn't
cause soft lockups unless we have a bug there - e.g. not triggering any
of those paths with empty LRUs and looping somewhere. Not sure about
MGLRU state of things TBH.

And finally, even without the hang problem, does it make any sense to
skip all the possible memcg charge completely, either to reduce latency
or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?

Let me just add to the pile of questions. Who does own this memory?

A special inode inside btrfs, we call it btree_inode, which is not
accessible out of the btrfs module, and its lifespan is the same as the
mounted btrfs filesystem.

But the memory charge is attributed to the caller unless you tell
otherwise.

By the caller, did you mean the user space program who triggered the 
filesystem operations?

Then it's too hard to determine. Almost all operations of btrfs involves 
its metadata, from the basic read/write, even to some endio functions 
(delayed into workqueue, like verify the data against its csum).

So if this is really an internal use and you use a shared
infrastructure which expects the current task to be owner of the charged
memory then you need to wrap the initialization into set_active_memcg
scope.

And for root cgroup I guess it means we will have no memory limits or 
whatever, and filemap_add_folio() should always success (except real 
-ENOMEM situations or -EEXIST error btrfs would handle)?

Then it looks like a good solution at least from the respective of btrfs.

Thanks,
Qu