Re: Patch 'writeback, cgroup: release dying cgwbs by switching attached inodes' leads to kernel crash

Boyang Xue <bxue@xxxxxxxxxx> · Fri, 16 Jul 2021 09:37:53 +0800

On Fri, Jul 16, 2021 at 12:05 AM Roman Gushchin <guro@xxxxxx> wrote:
>
> On Thu, Jul 15, 2021 at 11:31:17AM +0200, Jan Kara wrote:
> > On Thu 15-07-21 09:42:06, Boyang Xue wrote:
> > > On Thu, Jul 15, 2021 at 7:46 AM Roman Gushchin <guro@xxxxxx> wrote:
> > > >
> > > > On Thu, Jul 15, 2021 at 12:22:28AM +0800, Boyang Xue wrote:
> > > > > Hi Jan,
> > > > >
> > > > > On Wed, Jul 14, 2021 at 5:26 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > >
> > > > > > On Wed 14-07-21 16:44:33, Boyang Xue wrote:
> > > > > > > Hi Roman,
> > > > > > >
> > > > > > > On Wed, Jul 14, 2021 at 12:12 PM Roman Gushchin <guro@xxxxxx> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jul 14, 2021 at 11:21:12AM +0800, Boyang Xue wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I'm not sure if this is the right place to report this bug, please
> > > > > > > > > correct me if I'm wrong.
> > > > > > > > >
> > > > > > > > > I found kernel-5.14.0-rc1 (built from the Linus tree) crash when it's
> > > > > > > > > running xfstests generic/256 on ext4 [1]. Looking at the call trace,
> > > > > > > > > it looks like the bug had been introduced by the commit
> > > > > > > > >
> > > > > > > > > c22d70a162d3 writeback, cgroup: release dying cgwbs by switching attached inodes
> > > > > > > > >
> > > > > > > > > It only happens on aarch64, not on x86_64, ppc64le and s390x. Testing
> > > > > > > > > was performed with the latest xfstests, and the bug can be reproduced
> > > > > > > > > on ext{2, 3, 4} with {1k, 2k, 4k} block sizes.
> > > > > > > >
> > > > > > > > Hello Boyang,
> > > > > > > >
> > > > > > > > thank you for the report!
> > > > > > > >
> > > > > > > > Do you know on which line the oops happens?
> > > > > > >
> > > > > > > I was trying to inspect the vmcore with crash utility, but
> > > > > > > unfortunately it doesn't work.
> > > > > >
> > > > > > Thanks for report!  Have you tried addr2line utility? Looking at the oops I
> > > > > > can see:
> > > > >
> > > > > Thanks for the tips!
> > > > >
> > > > > It's unclear to me that where to find the required address in the
> > > > > addr2line command line, i.e.
> > > > >
> > > > > addr2line -e /usr/lib/debug/lib/modules/5.14.0-0.rc1.15.bx.el9.aarch64/vmlinux
> > > > > <what address here?>
> > > >
> > > > You can use $nm <vmlinux> to get an address of cleanup_offline_cgwbs_workfn()
> > > > and then add 0x320.
> > >
> > > Thanks! Hope the following helps:
> >
> > Thanks for the data!
> >
> > > static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> > > {
> > >         struct bdi_writeback *wb;
> > >         LIST_HEAD(processed);
> > >
> > >         spin_lock_irq(&cgwb_lock);
> > >
> > >         while (!list_empty(&offline_cgwbs)) {
> > >                 wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> > >                                       offline_node);
> > >                 list_move(&wb->offline_node, &processed);
> > >
> > >                 /*
> > >                  * If wb is dirty, cleaning up the writeback by switching
> > >                  * attached inodes will result in an effective removal of any
> > >                  * bandwidth restrictions, which isn't the goal.  Instead,
> > >                  * it can be postponed until the next time, when all io
> > >                  * will be likely completed.  If in the meantime some inodes
> > >                  * will get re-dirtied, they should be eventually switched to
> > >                  * a new cgwb.
> > >                  */
> > >                 if (wb_has_dirty_io(wb))
> > >                         continue;
> > >
> > >                 if (!wb_tryget(wb))  <=== line#679
> > >                         continue;
> >
> > Aha, interesting. So it seems we crashed trying to dereference
> > wb->refcnt->data. So it looks like cgwb_release_workfn() raced with
> > cleanup_offline_cgwbs_workfn() and percpu_ref_exit() got called from
> > cgwb_release_workfn() and then cleanup_offline_cgwbs_workfn() called
> > wb_tryget(). I think the proper fix is to move:
> >
> >         spin_lock_irq(&cgwb_lock);
> >         list_del(&wb->offline_node);
> >         spin_unlock_irq(&cgwb_lock);
> >
> > in cgwb_release_workfn() to the beginning of that function so that we are
> > sure even cleanup_offline_cgwbs_workfn() cannot be working with the wb when
> > it is being released. Roman?
>
> Yes, it sounds like the most reasonable explanation.
> Thank you!
>
> Boyang, would you mind to test the following patch?

No problem. I'm testing it. Thanks for the patch.

>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 271f2ca862c8..f5561ea7d90a 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -398,12 +398,12 @@ static void cgwb_release_workfn(struct work_struct *work)
>         blkcg_unpin_online(blkcg);
>
>         fprop_local_destroy_percpu(&wb->memcg_completions);
> -       percpu_ref_exit(&wb->refcnt);
>
>         spin_lock_irq(&cgwb_lock);
>         list_del(&wb->offline_node);
>         spin_unlock_irq(&cgwb_lock);
>
> +       percpu_ref_exit(&wb->refcnt);
>         wb_exit(wb);
>         WARN_ON_ONCE(!list_empty(&wb->b_attached));
>         kfree_rcu(wb, rcu);
>