Re: iomap: writeback ioend/bio allocation deadlock risk

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 25 May 2021 14:28:54 +0800

On Tue, May 25, 2021 at 12:54:35PM +0800, Ming Lei wrote:
> On Tue, May 25, 2021 at 09:55:38AM +1000, Dave Chinner wrote:
> > On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> > > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > > > Just wondering why the ioend isn't submitted out after it becomes full?
> > > > 
> > > > block layer plugging?  Although failing bio allocations will kick that,
> > > > so that is not a deadlock risk.
> > > 
> > > These ioends are just added to one list stored on local stack variable(submit_list),
> > > how can block layer plugging observe & submit them out?
> > 
> > We ignore that, as the commit histoy says:
> > 
> > commit e10de3723c53378e7cf441529f563c316fdc0dd3
> > Author: Dave Chinner <dchinner@xxxxxxxxxx>
> > Date:   Mon Feb 15 17:23:12 2016 +1100
> > 
> >     xfs: don't chain ioends during writepage submission
> > 
> >     Currently we can build a long ioend chain during ->writepages that
> >     gets attached to the writepage context. IO submission only then
> >     occurs when we finish all the writepage processing. This means we
> >     can have many ioends allocated and pending, and this violates the
> >     mempool guarantees that we need to give about forwards progress.
> >     i.e. we really should only have one ioend being built at a time,
> >     otherwise we may drain the mempool trying to allocate a new ioend
> >     and that blocks submission, completion and freeing of ioends that
> >     are already in progress.
> > 
> >     To prevent this situation from happening, we need to submit ioends
> >     for IO as soon as they are ready for dispatch rather than queuing
> >     them for later submission. This means the ioends have bios built
> >     immediately and they get queued on any plug that is current active.
> >     Hence if we schedule away from writeback, the ioends that have been
> >     built will make forwards progress due to the plug flushing on
> >     context switch. This will also prevent context switches from
> >     creating unnecessary IO submission latency.
> > 
> >     We can't completely avoid having nested IO allocation - when we have
> >     a block size smaller than a page size, we still need to hold the
> >     ioend submission until after we have marked the current page dirty.
> >     Hence we may need multiple ioends to be held while the current page
> >     is completely mapped and made ready for IO dispatch. We cannot avoid
> >     this problem - the current code already has this ioend chaining
> >     within a page so we can mostly ignore that it occurs.
> > 
> >     Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> >     Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> >     Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
> > 
> > IOWs, this nesting for block size < page size has been out there
> > for many years now and we've yet to have anyone report that
> > writeback deadlocks have occurred.
> > 
> > There's a mistake in that commit message - we can't submit the
> > ioends on a page until we've marked the page as under writeback, not
> > dirty. That's because we cannot have ioends completing on a a page
> > that isn't under writeback because calling end_page_writeback() on
> > it when it isn't under writeback will BUG(). Hence we have to build
> > all the submission state before we mark the page as under writeback
> > and perform the submission(s) to avoid completion racing with
> > submission.
> > 
> > Hence we can't actually avoid nesting ioend allocation here within a
> > single page - the constraints of page cache writeback require it.
> > Hence the construction of the iomap_ioend_bioset uses a pool size of:
> > 
> > 	4 * (PAGE_SIZE / SECTOR_SIZE)
> > 
> > So that we always have enough ioends cached in the mempool to
> > guarantee forwards progress of writeback of any single page under
> > writeback.
> 
> OK, looks it is just for subpage IO, so there isn't such issue
> in case of multiple ioends.

Thinking of further, this way is still fragile, suppose there are 8
contexts in which writeback is working on at the same time, and each
needs to allocate 6 ioends, so all contexts may not move on when
allocating its 6th ioend.

> 
> > 
> > But that is a completely separate problem to this:
> > 
> > > Chained bios have been submitted already, but all can't be completed/freed
> > > until the whole ioend is done, that submission won't make forward progress.
> > 
> > This is a problem caused by having unbound contiguous ioend sizes,
> > not a problem caused by chaining bios. We can throw 256 pages into
> > a bio, so when we are doing huge contiguous IOs, we can map a
> > lot of sequential dirty pages into a contiguous extent into a very
> > long bio chain. But if we cap the max ioend size to, say, 4096
> > pages, then we've effectively capped the number of bios that can be
> > nested by such a writeback chain.
> 
> If the 4096 pages are not continuous, there may be 4096/256=16 bios
> allocated for single ioend, and one is allocated from iomap_ioend_bioset,
> another 15 is allocated by bio_alloc() from fs_bio_set which just
> reserves 2 bios.
> 
> > 
> > I was about to point you at the patchset that fixes this, but you've
> > already found it and are claiming that this nesting is an unfixable
> > problem. Capping the size of the ioend also bounds the depth of the
> > allocation nesting that will occur, and that fixes the whole nseting
> > deadlock problem: If the mempool reserves are deeper than than the
> > maximum chain nesting that can occur, then there is no deadlock.
> > 
> > However, this points out what the real problem here is: that bio
> > allocation is designed to deadlock when nesting bio allocation rather
> > than fail. Hence at the iomap level we've go no way of knowing that
> > we should terminate and submit the current bio chain and start a new
> > ioend+bio chain because we will hang before there's any indication
> > that a deadlock could occur.
> 
> Most of reservation is small, such as fs_bio_set, so bio_alloc_bioset()
> documents that 'never allocate more than 1 bio at a time'. Actually
> iomap_chain_bio() does allocate a new one, then submits the old one.
> 'fs_bio_set' reserves two bios, so the order(alloc before submit) is fine,

It may not be fine when more than one context is involved in writeback.

Thanks,
Ming