On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote: > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote: > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote: > > > Just wondering why the ioend isn't submitted out after it becomes full? > > > > block layer plugging? Although failing bio allocations will kick that, > > so that is not a deadlock risk. > > These ioends are just added to one list stored on local stack variable(submit_list), > how can block layer plugging observe & submit them out? We ignore that, as the commit histoy says: commit e10de3723c53378e7cf441529f563c316fdc0dd3 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Feb 15 17:23:12 2016 +1100 xfs: don't chain ioends during writepage submission Currently we can build a long ioend chain during ->writepages that gets attached to the writepage context. IO submission only then occurs when we finish all the writepage processing. This means we can have many ioends allocated and pending, and this violates the mempool guarantees that we need to give about forwards progress. i.e. we really should only have one ioend being built at a time, otherwise we may drain the mempool trying to allocate a new ioend and that blocks submission, completion and freeing of ioends that are already in progress. To prevent this situation from happening, we need to submit ioends for IO as soon as they are ready for dispatch rather than queuing them for later submission. This means the ioends have bios built immediately and they get queued on any plug that is current active. Hence if we schedule away from writeback, the ioends that have been built will make forwards progress due to the plug flushing on context switch. This will also prevent context switches from creating unnecessary IO submission latency. We can't completely avoid having nested IO allocation - when we have a block size smaller than a page size, we still need to hold the ioend submission until after we have marked the current page dirty. Hence we may need multiple ioends to be held while the current page is completely mapped and made ready for IO dispatch. We cannot avoid this problem - the current code already has this ioend chaining within a page so we can mostly ignore that it occurs. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx> IOWs, this nesting for block size < page size has been out there for many years now and we've yet to have anyone report that writeback deadlocks have occurred. There's a mistake in that commit message - we can't submit the ioends on a page until we've marked the page as under writeback, not dirty. That's because we cannot have ioends completing on a a page that isn't under writeback because calling end_page_writeback() on it when it isn't under writeback will BUG(). Hence we have to build all the submission state before we mark the page as under writeback and perform the submission(s) to avoid completion racing with submission. Hence we can't actually avoid nesting ioend allocation here within a single page - the constraints of page cache writeback require it. Hence the construction of the iomap_ioend_bioset uses a pool size of: 4 * (PAGE_SIZE / SECTOR_SIZE) So that we always have enough ioends cached in the mempool to guarantee forwards progress of writeback of any single page under writeback. But that is a completely separate problem to this: > Chained bios have been submitted already, but all can't be completed/freed > until the whole ioend is done, that submission won't make forward progress. This is a problem caused by having unbound contiguous ioend sizes, not a problem caused by chaining bios. We can throw 256 pages into a bio, so when we are doing huge contiguous IOs, we can map a lot of sequential dirty pages into a contiguous extent into a very long bio chain. But if we cap the max ioend size to, say, 4096 pages, then we've effectively capped the number of bios that can be nested by such a writeback chain. I was about to point you at the patchset that fixes this, but you've already found it and are claiming that this nesting is an unfixable problem. Capping the size of the ioend also bounds the depth of the allocation nesting that will occur, and that fixes the whole nseting deadlock problem: If the mempool reserves are deeper than than the maximum chain nesting that can occur, then there is no deadlock. However, this points out what the real problem here is: that bio allocation is designed to deadlock when nesting bio allocation rather than fail. Hence at the iomap level we've go no way of knowing that we should terminate and submit the current bio chain and start a new ioend+bio chain because we will hang before there's any indication that a deadlock could occur. And then the elephant in the room: reality. We've been nesting bio allocations via this chaining in production systems under heavy memory pressure for 5 years now and we don't have a single bug report indicating that this code deadlocks. So while there's a theoretical problem, evidence points to it not being an issue in practice. Hence I don't think there is any need to urgently turn this code upside down. I'd much prefer that bio allocation would fail rather than deadlock, because then we can set a flag in the writepage context that says "do single bio ioend submission only" for the duration of that writeback context. And with that, the forwards progress problem is solved whilst still allowing us to chain as deeply as we want when there is no memory pressure.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx