Re: [5.4-rc1, regression] wb_workfn wakeup oops (was Re: frequent 5.4-rc1 crash?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Chris,

On Thu, Oct 03, 2019 at 02:01:13PM +0000, Chris Mason wrote:
> 
> 
> On 3 Oct 2019, at 4:41, Gao Xiang wrote:
> 
> > Hi,
> >
> > On Thu, Oct 03, 2019 at 04:40:22PM +1000, Dave Chinner wrote:
> >> [cc linux-fsdevel, linux-block, tejun ]
> >>
> >> On Wed, Oct 02, 2019 at 06:52:47PM -0700, Darrick J. Wong wrote:
> >>> Hi everyone,
> >>>
> >>> Does anyone /else/ see this crash in generic/299 on a V4 filesystem 
> >>> (tho
> >>> afaict V5 configs crash too) and a 5.4-rc1 kernel?  It seems to pop 
> >>> up
> >>> on generic/299 though only 80% of the time.
> >>>
> >
> > Just a quick glance, I guess there could is a race between (complete 
> > guess):
> >
> >
> >  160 static void finish_writeback_work(struct bdi_writeback *wb,
> >  161                                   struct wb_writeback_work *work)
> >  162 {
> >  163         struct wb_completion *done = work->done;
> >  164
> >  165         if (work->auto_free)
> >  166                 kfree(work);
> >  167         if (done && atomic_dec_and_test(&done->cnt))
> >
> >  ^^^ here
> >
> >  168                 wake_up_all(done->waitq);
> >  169 }
> >
> > since new wake_up_all(done->waitq); is completely on-stack,
> >  	if (done && atomic_dec_and_test(&done->cnt))
> > -		wake_up_all(&wb->bdi->wb_waitq);
> > +		wake_up_all(done->waitq);
> >  }
> >
> > which could cause use after free if on-stack wb_completion is gone...
> > (however previous wb->bdi is solid since it is not on-stack)
> >
> > see generic on-stack completion which takes a wait_queue spin_lock 
> > between
> > test and wake_up...
> >
> > If I am wrong, ignore me, hmm...
> 
> It's a good guess ;)  Jens should have this queued up already:
> 
> https://lkml.org/lkml/2019/9/23/972

Oh, I didn't notice that, it's great to be already resolved. :)

It was not fully guess though, we once had a some similar
pattern at the very early stage last year (a given IO balance
counter, wait_queue. but completion is too heavy), which resolved
in commit 848bd9acdcd0 last year. Therefore I'm experienced
with such cases.

Just saw mailing list regularly and be of some help here...
Sorry about the noise...

Thanks,
Gao Xiang

> 
> -chris



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux