Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

Hillf Danton <hdanton@xxxxxxxx> · Mon, 7 Oct 2019 14:01:44 +0800

On Fri, 4 Oct 2019 15:11:04 -0700 Roman Gushchin wrote:
> 
> This is a RFC patch, which is not intended to be merged as is,
> but hopefully will start a discussion which can result in a good
> solution for the described problem.
> --
> We've noticed that the number of dying cgroups on our production hosts
> tends to grow with the uptime. This time it's caused by the writeback
> code.
> 
> An inode which is getting dirty for the first time is associated
> with the wb structure (look at __inode_attach_wb()). It can later
> be switched to another wb under some conditions (e.g. some other
> cgroup is writing a lot of data to the same inode), but generally
> stays associated up to the end of life of the inode structure.
> 
> The problem is that the wb structure holds a reference to the original
> memory cgroup. So if the inode was dirty once, it has a good chance
> to pin down the original memory cgroup.
> 
> An example from the real life: some service runs periodically and
> updates rpm packages. Each time in a new memory cgroup. Installed
> .so files are heavily used by other cgroups, so corresponding inodes
> tend to stay alive for a long. So do pinned memory cgroups.
> In production I've seen many hosts with 1-2 thousands of dying
> cgroups.

The diff below fixes e8a7abf5a5bd ("writeback: disassociate inodes
from dying bdi_writebacks") by selecting new memcg_css id for dying
bdi_writeback to switch to.
Checking offline memcg is also added, which is perhaps needed in your
case. Let us know if it makes sense in helping you cut dying cgroups
down a bit.

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -552,6 +552,8 @@ out_free:
 void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 				 struct inode *inode)
 {
+	int new_id = 0;
+
 	if (!inode_cgwb_enabled(inode)) {
 		spin_unlock(&inode->i_lock);
 		return;
@@ -560,6 +562,22 @@ void wbc_attach_and_unlock_inode(struct
 	wbc->wb = inode_to_wb(inode);
 	wbc->inode = inode;
 
+	if (unlikely(wb_dying(wbc->wb)) ||
+	    !mem_cgroup_from_css(wbc->wb->memcg_css)->cgwb_list.next) {
+		int id = wbc->wb->memcg_css->id;
+		/*
+		 * any css id is fine in order to let dying/offline
+		 * memcg reap
+		 */
+		if (id != wbc->wb_id && wbc->wb_id)
+			new_id = wbc->wb_id;
+		else if (id != wbc->wb_lcand_id && wbc->wb_lcand_id)
+			new_id = wbc->wb_lcand_id;
+		else if (id != wbc->wb_tcand_id && wbc->wb_tcand_id)
+			new_id = wbc->wb_tcand_id;
+		else
+			new_id = inode_to_bdi(inode)->wb.memcg_css->id;
+	}
 	wbc->wb_id = wbc->wb->memcg_css->id;
 	wbc->wb_lcand_id = inode->i_wb_frn_winner;
 	wbc->wb_tcand_id = 0;
@@ -574,8 +592,8 @@ void wbc_attach_and_unlock_inode(struct
 	 * A dying wb indicates that the memcg-blkcg mapping has changed
 	 * and a new wb is already serving the memcg.  Switch immediately.
 	 */
-	if (unlikely(wb_dying(wbc->wb)))
-		inode_switch_wbs(inode, wbc->wb_id);
+	if (new_id)
+		inode_switch_wbs(inode, new_id);
 }
 EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
 
--