Re: [PATCH] fs/writeback: Attach inode's wb to root if needed

Dennis Zhou <dennis@xxxxxxxxxx> · Mon, 13 May 2019 14:30:53 -0400

Hi Liguang,

On Fri, May 10, 2019 at 09:54:27AM +0800, 乱石 wrote:
> Hi Tejun,
> 
> 在 2019/5/10 0:48, Tejun Heo 写道:
> > Hi Tejun,
> > 
> > On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote:
> > > There might have tons of files queued in the writeback, awaiting for
> > > writing back. Unfortunately, the writeback's cgroup has been dead. In
> > > this case, we reassociate the inode with another writeback cgroup, but
> > > we possibly can't because the writeback associated with the dead cgroup
> > > is the only valid one. In this case, the new writeback is allocated,
> > > initialized and associated with the inode. It causes unnecessary high
> > > system load and latency.
> > > 
> > > This fixes the issue by enforce moving the inode to root cgroup when the
> > > previous binding cgroup becomes dead. With it, no more unnecessary
> > > writebacks are created, populated and the system load decreased by about
> > > 6x in the online service we encounted:
> > >      Without the patch: about 30% system load
> > >      With the patch:    about  5% system load
> > Can you please describe the scenario with more details?  I'm having a
> > bit of hard time understanding the amount of cpu cycles being
> > consumed.
> > 
> > Thanks.
> 
> Our search line reported a problem, when containerA was removed,
> containerB and containerC's system load were up to 30%.
> 
> We record the trace with 'perf record cycles:k -g -a', found that wb_init
> was the hotspot function.
> 
> Function call:
> 
> generic_file_direct_write
>    filemap_write_and_wait_range
>       __filemap_fdatawrite_range
>          wbc_attach_fdatawrite_inode
>             inode_attach_wb
>                __inode_attach_wb
>                   wb_get_create
>             wbc_attach_and_unlock_inode
>                if (unlikely(wb_dying(wbc->wb)))
>                   inode_switch_wbs
>                      wb_get_create
>                         ; Search bdi->cgwb_tree from memcg_css->id
>                         ; OR cgwb_create
>                            kmalloc
>                            wb_init       // hot spot
>                            ; Insert to bdi->cgwb_tree, mmecg_css->id as key
> 
> We discussed it through, base on the analysis:  When we running into the
> issue, there is cgroups are being deleted. The inodes (files) that were
> associated with these cgroups have to switch into another newly created
> writeback. We think there are huge amount of inodes in the writeback list
> that time. So we don't think there is anything abnormal. However, one
> thing we possibly can do: enforce these inodes to BDI embedded wirteback
> and we needn't create huge amount of writebacks in that case, to avoid
> the high system load phenomenon. We expect correct wb (best candidate) is
> picked up in next round.
> 
> Thanks,
> Liguang
> 
> > 
> 

If I understand correctly, this is mostlikely caused by a file shared by
cgroup A and cgroup B. This means cgroup B is doing direct io against
the file owned by the dying cgroup A. In this case, the code tries to do
a wb switch. However, it fails to reallocate the wb as it's deleted and
for the original cgrouip A's memcg id.

I think the below may be a better solution. Could you please test it? If
it works, I'll spin a patch with a more involved description.

Thanks,
Dennis

---

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36855c1f8daf..fb331ea2a626 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -577,7 +577,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 	 * A dying wb indicates that the memcg-blkcg mapping has changed
 	 * and a new wb is already serving the memcg.  Switch immediately.
 	 */
-	if (unlikely(wb_dying(wbc->wb)))
+	if (unlikely(wb_dying(wbc->wb)) && !css_is_dying(wbc->wb->memcg_css))
 		inode_switch_wbs(inode, wbc->wb_id);
 }
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 72e6d0c55cfa..685563ed9788 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -659,7 +659,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 
 	might_sleep_if(gfpflags_allow_blocking(gfp));
 
-	if (!memcg_css->parent)
+	if (!memcg_css->parent || css_is_dying(memcg_css))
 		return &bdi->wb;
 
 	do {