On 2021/05/25 13:20, Darrick J. Wong wrote: > On Mon, May 24, 2021 at 08:02:18AM -0400, Brian Foster wrote: >> On Thu, May 20, 2021 at 04:27:37PM -0700, Darrick J. Wong wrote: >>> On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote: >>>> The iomap writeback infrastructure is currently able to construct >>>> extremely large bio chains (tens of GBs) associated with a single >>>> ioend. This consolidation provides no significant value as bio >>>> chains increase beyond a reasonable minimum size. On the other hand, >>>> this does hold significant numbers of pages in the writeback >>>> state across an unnecessarily large number of bios because the ioend >>>> is not processed for completion until the final bio in the chain >>>> completes. Cap an individual ioend to a reasonable size of 4096 >>>> pages (16MB with 4k pages) to avoid this condition. >>>> >>>> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> >>>> --- >>>> fs/iomap/buffered-io.c | 6 ++++-- >>>> include/linux/iomap.h | 26 ++++++++++++++++++++++++++ >>>> 2 files changed, 30 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>> index 642422775e4e..f2890ee434d0 100644 >>>> --- a/fs/iomap/buffered-io.c >>>> +++ b/fs/iomap/buffered-io.c >>>> @@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev) >>>> >>>> static bool >>>> iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset, >>>> - sector_t sector) >>>> + unsigned len, sector_t sector) >>>> { >>>> if ((wpc->iomap.flags & IOMAP_F_SHARED) != >>>> (wpc->ioend->io_flags & IOMAP_F_SHARED)) >>>> @@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset, >>>> return false; >>>> if (sector != bio_end_sector(wpc->ioend->io_bio)) >>>> return false; >>>> + if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE) >>>> + return false; >>>> return true; >>>> } >>>> >>>> @@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, >>>> unsigned poff = offset & (PAGE_SIZE - 1); >>>> bool merged, same_page = false; >>>> >>>> - if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) { >>>> + if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) { >>>> if (wpc->ioend) >>>> list_add(&wpc->ioend->io_list, iolist); >>>> wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc); >>>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h >>>> index 07f3f4e69084..89b15cc236d5 100644 >>>> --- a/include/linux/iomap.h >>>> +++ b/include/linux/iomap.h >>>> @@ -203,6 +203,32 @@ struct iomap_ioend { >>>> struct bio io_inline_bio; /* MUST BE LAST! */ >>>> }; >>>> >>>> +/* >>>> + * Maximum ioend IO size is used to prevent ioends from becoming unbound in >>>> + * size. bios can reach 4GB in size if pages are contiguous, and bio chains are >>>> + * effectively unbound in length. Hence the only limits on the size of the bio >>>> + * chain is the contiguity of the extent on disk and the length of the run of >>>> + * sequential dirty pages in the page cache. This can be tens of GBs of physical >>>> + * extents and if memory is large enough, tens of millions of dirty pages. >>>> + * Locking them all under writeback until the final bio in the chain is >>>> + * submitted and completed locks all those pages for the legnth of time it takes >>> >>> s/legnth/length/ >>> >> >> Fixed. >> >>>> + * to write those many, many GBs of data to storage. >>>> + * >>>> + * Background writeback caps any single writepages call to half the device >>>> + * bandwidth to ensure fairness and prevent any one dirty inode causing >>>> + * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such >>>> + * cap on wbc->nr_pages, and that's where the above massive bio chain lengths >>>> + * come from. We want large IOs to reach the storage, but we need to limit >>>> + * completion latencies, hence we need to control the maximum IO size we >>>> + * dispatch to the storage stack. >>>> + * >>>> + * We don't really have to care about the extra IO completion overhead here >>>> + * because iomap has contiguous IO completion merging. If the device can sustain >>> >>> Assuming you're referring to iomap_finish_ioends, only XFS employs the >>> ioend completion merging, and only for ioends where it decides to >>> override the default bi_end_io. iomap on its own never calls >>> iomap_ioend_try_merge. >>> >>> This patch establishes a maximum ioend size of 4096 pages so that we >>> don't trip the lockup watchdog while clearing pagewriteback and also so >>> that we don't pin a large number of pages while constructing a big chain >>> of bios. On gfs2 and zonefs, each ioend completion will now have to >>> clear up to 4096 pages from whatever context bio_endio is called. >>> >>> For XFS it's a more complicated -- XFS already overrode the bio handler >>> for ioends that required further metadata updates (e.g. unwritten >>> conversion, eof extension, or cow) so that it could combine ioends when >>> possible. XFS wants to combine ioends to amortize the cost of getting >>> the ILOCK and running transactions over a larger number of pages. >>> >>> So I guess I see how the two changes dovetail nicely for XFS -- iomap >>> issues smaller write bios, and the xfs ioend worker can recombine >>> however many bios complete before the worker runs. As a bonus, we don't >>> have to worry about situations like the device driver completing so many >>> bios from a single invocation of a bottom half handler that we run afoul >>> of the soft lockup timer. >>> >>> Is that a correct understanding of how the two changes intersect with >>> each other? TBH I was expecting the two thresholds to be closer in >>> value. >>> >> >> I think so. That's interesting because my inclination was to make them >> farther apart (or more specifically, increase the threshold in this >> patch and leave the previous). The primary goal of this series was to >> address the soft lockup warning problem, hence the thresholds on earlier >> versions started at rather conservative values. I think both values have >> been reasonably justified in being reduced, though this patch has a more >> broad impact than the previous in that it changes behavior for all iomap >> based fs'. Of course that's something that could also be addressed with >> a more dynamic tunable.. > > <shrug> I think I'm comfortable starting with 256 for xfs to bump an > ioend to a workqueue, and 4096 pages as the limit for an iomap ioend. > If people demonstrate a need to smart-tune or manual-tune we can always > add one later. > > Though I guess I did kind of wonder if maybe a better limit for iomap > would be max_hw_sectors? Since that's the maximum size of an IO that > the kernel will for that device? > > (Hm, maybe not; my computers all have it set to 1280k, which is a > pathetic 20 pages on a 64k-page system.) Are you sure you are not looking at max_sectors (not max_hw_sectors) ? For an average SATA HDD, generally, you get: # cat max_hw_sectors_kb 32764 # cat max_sectors_kb 1280 So 32MB max command size per hardware limitations. That one cannot be changed. The block IO layer uses the 1280KB soft limit defined by max_sectors (max_sectors_kb in sysfs) but the user can tune this from 1 sector up to max_hw_sectors_kb. > >>> The other two users of iomap for buffered io (gfs2 and zonefs) don't >>> have a means to defer and combine ioends like xfs does. Do you think >>> they should? I think it's still possible to trip the softlockup there. >>> >> >> I'm not sure. We'd probably want some feedback from developers of >> filesystems other than XFS before incorporating a change like this. The >> first patch in the series more just provides some infrastructure for >> other filesystems to avoid the problem as they see fit. > > Hmm. Any input from the two other users of iomap buffered IO? Who are > now directly in the to: list? :D > > Catch-up TLDR: we're evaluating a proposal to limit the length of an > iomap writeback ioend to 4096 pages so that we don't trip the hangcheck > warning while clearing pagewriteback if the ioend completion happens to > run in softirq context (e.g. nvme completion). > > --D > >> Brian >> >>> --D >>> >>>> + * high throughput and large bios, the ioends are merged on completion and >>>> + * processed in large, efficient chunks with no additional IO latency. >>>> + */ >>>> +#define IOEND_MAX_IOSIZE (4096ULL << PAGE_SHIFT) >>>> + >>>> struct iomap_writeback_ops { >>>> /* >>>> * Required, maps the blocks so that writeback can be performed on >>>> -- >>>> 2.26.3 >>>> >>> >> > -- Damien Le Moal Western Digital Research