On Tue, May 25, 2021 at 12:54:35PM +0800, Ming Lei wrote: > On Tue, May 25, 2021 at 09:55:38AM +1000, Dave Chinner wrote: > > On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote: > > > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote: > > > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote: > > > > > Just wondering why the ioend isn't submitted out after it becomes full? > > > > > > > > block layer plugging? Although failing bio allocations will kick that, > > > > so that is not a deadlock risk. > > > > > > These ioends are just added to one list stored on local stack variable(submit_list), > > > how can block layer plugging observe & submit them out? > > > > We ignore that, as the commit histoy says: > > > > commit e10de3723c53378e7cf441529f563c316fdc0dd3 > > Author: Dave Chinner <dchinner@xxxxxxxxxx> > > Date: Mon Feb 15 17:23:12 2016 +1100 > > > > xfs: don't chain ioends during writepage submission > > > > Currently we can build a long ioend chain during ->writepages that > > gets attached to the writepage context. IO submission only then > > occurs when we finish all the writepage processing. This means we > > can have many ioends allocated and pending, and this violates the > > mempool guarantees that we need to give about forwards progress. > > i.e. we really should only have one ioend being built at a time, > > otherwise we may drain the mempool trying to allocate a new ioend > > and that blocks submission, completion and freeing of ioends that > > are already in progress. > > > > To prevent this situation from happening, we need to submit ioends > > for IO as soon as they are ready for dispatch rather than queuing > > them for later submission. This means the ioends have bios built > > immediately and they get queued on any plug that is current active. > > Hence if we schedule away from writeback, the ioends that have been > > built will make forwards progress due to the plug flushing on > > context switch. This will also prevent context switches from > > creating unnecessary IO submission latency. > > > > We can't completely avoid having nested IO allocation - when we have > > a block size smaller than a page size, we still need to hold the > > ioend submission until after we have marked the current page dirty. > > Hence we may need multiple ioends to be held while the current page > > is completely mapped and made ready for IO dispatch. We cannot avoid > > this problem - the current code already has this ioend chaining > > within a page so we can mostly ignore that it occurs. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > > Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx> > > > > IOWs, this nesting for block size < page size has been out there > > for many years now and we've yet to have anyone report that > > writeback deadlocks have occurred. > > > > There's a mistake in that commit message - we can't submit the > > ioends on a page until we've marked the page as under writeback, not > > dirty. That's because we cannot have ioends completing on a a page > > that isn't under writeback because calling end_page_writeback() on > > it when it isn't under writeback will BUG(). Hence we have to build > > all the submission state before we mark the page as under writeback > > and perform the submission(s) to avoid completion racing with > > submission. > > > > Hence we can't actually avoid nesting ioend allocation here within a > > single page - the constraints of page cache writeback require it. > > Hence the construction of the iomap_ioend_bioset uses a pool size of: > > > > 4 * (PAGE_SIZE / SECTOR_SIZE) > > > > So that we always have enough ioends cached in the mempool to > > guarantee forwards progress of writeback of any single page under > > writeback. > > OK, looks it is just for subpage IO, so there isn't such issue > in case of multiple ioends. Thinking of further, this way is still fragile, suppose there are 8 contexts in which writeback is working on at the same time, and each needs to allocate 6 ioends, so all contexts may not move on when allocating its 6th ioend. > > > > > But that is a completely separate problem to this: > > > > > Chained bios have been submitted already, but all can't be completed/freed > > > until the whole ioend is done, that submission won't make forward progress. > > > > This is a problem caused by having unbound contiguous ioend sizes, > > not a problem caused by chaining bios. We can throw 256 pages into > > a bio, so when we are doing huge contiguous IOs, we can map a > > lot of sequential dirty pages into a contiguous extent into a very > > long bio chain. But if we cap the max ioend size to, say, 4096 > > pages, then we've effectively capped the number of bios that can be > > nested by such a writeback chain. > > If the 4096 pages are not continuous, there may be 4096/256=16 bios > allocated for single ioend, and one is allocated from iomap_ioend_bioset, > another 15 is allocated by bio_alloc() from fs_bio_set which just > reserves 2 bios. > > > > > I was about to point you at the patchset that fixes this, but you've > > already found it and are claiming that this nesting is an unfixable > > problem. Capping the size of the ioend also bounds the depth of the > > allocation nesting that will occur, and that fixes the whole nseting > > deadlock problem: If the mempool reserves are deeper than than the > > maximum chain nesting that can occur, then there is no deadlock. > > > > However, this points out what the real problem here is: that bio > > allocation is designed to deadlock when nesting bio allocation rather > > than fail. Hence at the iomap level we've go no way of knowing that > > we should terminate and submit the current bio chain and start a new > > ioend+bio chain because we will hang before there's any indication > > that a deadlock could occur. > > Most of reservation is small, such as fs_bio_set, so bio_alloc_bioset() > documents that 'never allocate more than 1 bio at a time'. Actually > iomap_chain_bio() does allocate a new one, then submits the old one. > 'fs_bio_set' reserves two bios, so the order(alloc before submit) is fine, It may not be fine when more than one context is involved in writeback. Thanks, Ming