Hi Liguang, On Fri, May 10, 2019 at 09:54:27AM +0800, 乱石 wrote: > Hi Tejun, > > 在 2019/5/10 0:48, Tejun Heo 写道: > > Hi Tejun, > > > > On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote: > > > There might have tons of files queued in the writeback, awaiting for > > > writing back. Unfortunately, the writeback's cgroup has been dead. In > > > this case, we reassociate the inode with another writeback cgroup, but > > > we possibly can't because the writeback associated with the dead cgroup > > > is the only valid one. In this case, the new writeback is allocated, > > > initialized and associated with the inode. It causes unnecessary high > > > system load and latency. > > > > > > This fixes the issue by enforce moving the inode to root cgroup when the > > > previous binding cgroup becomes dead. With it, no more unnecessary > > > writebacks are created, populated and the system load decreased by about > > > 6x in the online service we encounted: > > > Without the patch: about 30% system load > > > With the patch: about 5% system load > > Can you please describe the scenario with more details? I'm having a > > bit of hard time understanding the amount of cpu cycles being > > consumed. > > > > Thanks. > > Our search line reported a problem, when containerA was removed, > containerB and containerC's system load were up to 30%. > > We record the trace with 'perf record cycles:k -g -a', found that wb_init > was the hotspot function. > > Function call: > > generic_file_direct_write > filemap_write_and_wait_range > __filemap_fdatawrite_range > wbc_attach_fdatawrite_inode > inode_attach_wb > __inode_attach_wb > wb_get_create > wbc_attach_and_unlock_inode > if (unlikely(wb_dying(wbc->wb))) > inode_switch_wbs > wb_get_create > ; Search bdi->cgwb_tree from memcg_css->id > ; OR cgwb_create > kmalloc > wb_init // hot spot > ; Insert to bdi->cgwb_tree, mmecg_css->id as key > > We discussed it through, base on the analysis: When we running into the > issue, there is cgroups are being deleted. The inodes (files) that were > associated with these cgroups have to switch into another newly created > writeback. We think there are huge amount of inodes in the writeback list > that time. So we don't think there is anything abnormal. However, one > thing we possibly can do: enforce these inodes to BDI embedded wirteback > and we needn't create huge amount of writebacks in that case, to avoid > the high system load phenomenon. We expect correct wb (best candidate) is > picked up in next round. > > Thanks, > Liguang > > > > If I understand correctly, this is mostlikely caused by a file shared by cgroup A and cgroup B. This means cgroup B is doing direct io against the file owned by the dying cgroup A. In this case, the code tries to do a wb switch. However, it fails to reallocate the wb as it's deleted and for the original cgrouip A's memcg id. I think the below may be a better solution. Could you please test it? If it works, I'll spin a patch with a more involved description. Thanks, Dennis --- diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 36855c1f8daf..fb331ea2a626 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -577,7 +577,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, * A dying wb indicates that the memcg-blkcg mapping has changed * and a new wb is already serving the memcg. Switch immediately. */ - if (unlikely(wb_dying(wbc->wb))) + if (unlikely(wb_dying(wbc->wb)) && !css_is_dying(wbc->wb->memcg_css)) inode_switch_wbs(inode, wbc->wb_id); } diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 72e6d0c55cfa..685563ed9788 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -659,7 +659,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, might_sleep_if(gfpflags_allow_blocking(gfp)); - if (!memcg_css->parent) + if (!memcg_css->parent || css_is_dying(memcg_css)) return &bdi->wb; do {