Re: iomap: writeback ioend/bio allocation deadlock risk

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 25 May 2021 12:54:35 +0800

On Tue, May 25, 2021 at 09:55:38AM +1000, Dave Chinner wrote:
> On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > > Just wondering why the ioend isn't submitted out after it becomes full?
> > > 
> > > block layer plugging?  Although failing bio allocations will kick that,
> > > so that is not a deadlock risk.
> > 
> > These ioends are just added to one list stored on local stack variable(submit_list),
> > how can block layer plugging observe & submit them out?
> 
> We ignore that, as the commit histoy says:
> 
> commit e10de3723c53378e7cf441529f563c316fdc0dd3
> Author: Dave Chinner <dchinner@xxxxxxxxxx>
> Date:   Mon Feb 15 17:23:12 2016 +1100
> 
>     xfs: don't chain ioends during writepage submission
> 
>     Currently we can build a long ioend chain during ->writepages that
>     gets attached to the writepage context. IO submission only then
>     occurs when we finish all the writepage processing. This means we
>     can have many ioends allocated and pending, and this violates the
>     mempool guarantees that we need to give about forwards progress.
>     i.e. we really should only have one ioend being built at a time,
>     otherwise we may drain the mempool trying to allocate a new ioend
>     and that blocks submission, completion and freeing of ioends that
>     are already in progress.
> 
>     To prevent this situation from happening, we need to submit ioends
>     for IO as soon as they are ready for dispatch rather than queuing
>     them for later submission. This means the ioends have bios built
>     immediately and they get queued on any plug that is current active.
>     Hence if we schedule away from writeback, the ioends that have been
>     built will make forwards progress due to the plug flushing on
>     context switch. This will also prevent context switches from
>     creating unnecessary IO submission latency.
> 
>     We can't completely avoid having nested IO allocation - when we have
>     a block size smaller than a page size, we still need to hold the
>     ioend submission until after we have marked the current page dirty.
>     Hence we may need multiple ioends to be held while the current page
>     is completely mapped and made ready for IO dispatch. We cannot avoid
>     this problem - the current code already has this ioend chaining
>     within a page so we can mostly ignore that it occurs.
> 
>     Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
>     Reviewed-by: Christoph Hellwig <hch@xxxxxx>
>     Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
> 
> IOWs, this nesting for block size < page size has been out there
> for many years now and we've yet to have anyone report that
> writeback deadlocks have occurred.
> 
> There's a mistake in that commit message - we can't submit the
> ioends on a page until we've marked the page as under writeback, not
> dirty. That's because we cannot have ioends completing on a a page
> that isn't under writeback because calling end_page_writeback() on
> it when it isn't under writeback will BUG(). Hence we have to build
> all the submission state before we mark the page as under writeback
> and perform the submission(s) to avoid completion racing with
> submission.
> 
> Hence we can't actually avoid nesting ioend allocation here within a
> single page - the constraints of page cache writeback require it.
> Hence the construction of the iomap_ioend_bioset uses a pool size of:
> 
> 	4 * (PAGE_SIZE / SECTOR_SIZE)
> 
> So that we always have enough ioends cached in the mempool to
> guarantee forwards progress of writeback of any single page under
> writeback.

OK, looks it is just for subpage IO, so there isn't such issue
in case of multiple ioends.

> 
> But that is a completely separate problem to this:
> 
> > Chained bios have been submitted already, but all can't be completed/freed
> > until the whole ioend is done, that submission won't make forward progress.
> 
> This is a problem caused by having unbound contiguous ioend sizes,
> not a problem caused by chaining bios. We can throw 256 pages into
> a bio, so when we are doing huge contiguous IOs, we can map a
> lot of sequential dirty pages into a contiguous extent into a very
> long bio chain. But if we cap the max ioend size to, say, 4096
> pages, then we've effectively capped the number of bios that can be
> nested by such a writeback chain.

If the 4096 pages are not continuous, there may be 4096/256=16 bios
allocated for single ioend, and one is allocated from iomap_ioend_bioset,
another 15 is allocated by bio_alloc() from fs_bio_set which just
reserves 2 bios.

> 
> I was about to point you at the patchset that fixes this, but you've
> already found it and are claiming that this nesting is an unfixable
> problem. Capping the size of the ioend also bounds the depth of the
> allocation nesting that will occur, and that fixes the whole nseting
> deadlock problem: If the mempool reserves are deeper than than the
> maximum chain nesting that can occur, then there is no deadlock.
> 
> However, this points out what the real problem here is: that bio
> allocation is designed to deadlock when nesting bio allocation rather
> than fail. Hence at the iomap level we've go no way of knowing that
> we should terminate and submit the current bio chain and start a new
> ioend+bio chain because we will hang before there's any indication
> that a deadlock could occur.

Most of reservation is small, such as fs_bio_set, so bio_alloc_bioset()
documents that 'never allocate more than 1 bio at a time'. Actually
iomap_chain_bio() does allocate a new one, then submits the old one.
'fs_bio_set' reserves two bios, so the order(alloc before submit) is fine,
but all bios allocated from iomap_chain_bio() can only be freed after the
whole ioend is done, this way is fragile to deadlock, because submission
can't provide forward progress an more, such as, flushing plug list before
schedule out can't work as expected.

One question is why the chained bios in ioend aren't completed individually?
What is the advantage to complete all bios in iomap_finish_ioend()?

> 
> And then the elephant in the room: reality.
> 
> We've been nesting bio allocations via this chaining in production
> systems under heavy memory pressure for 5 years now and we don't
> have a single bug report indicating that this code deadlocks. So
> while there's a theoretical problem, evidence points to it not being
> an issue in practice.
> 
> Hence I don't think there is any need to urgently turn this code
> upside down. I'd much prefer that bio allocation would fail rather

GFP_NOWAIT can be passed to bio_alloc() if you like, but the caller has
to handle out of allocation. So far iomap ioend code supposes all bio
allocation can succeed.

Thanks, 
Ming