On Mon, 2022-01-10 at 12:45 -0500, Brian Foster wrote: > On Mon, Jan 10, 2022 at 07:18:47PM +1100, Dave Chinner wrote: > > On Thu, Jan 06, 2022 at 11:44:23AM -0500, Brian Foster wrote: > > > On Thu, Jan 06, 2022 at 09:04:21AM +1100, Dave Chinner wrote: > > > > On Wed, Jan 05, 2022 at 08:56:33AM -0500, Brian Foster wrote: > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > > index 71a36ae120ee..39214577bc46 100644 > > > > --- a/fs/iomap/buffered-io.c > > > > +++ b/fs/iomap/buffered-io.c > > > > @@ -1066,17 +1066,34 @@ iomap_finish_ioend(struct iomap_ioend > > > > *ioend, int error) > > > > } > > > > } > > > > > > > > +/* > > > > + * Ioend completion routine for merged bios. This can only be > > > > called from task > > > > + * contexts as merged ioends can be of unbound length. Hence > > > > we have to break up > > > > + * the page writeback completion into manageable chunks to > > > > avoid long scheduler > > > > + * holdoffs. We aim to keep scheduler holdoffs down below 10ms > > > > so that we get > > > > + * good batch processing throughput without creating adverse > > > > scheduler latency > > > > + * conditions. > > > > + */ > > > > void > > > > iomap_finish_ioends(struct iomap_ioend *ioend, int error) > > > > { > > > > struct list_head tmp; > > > > + int segments; > > > > + > > > > + might_sleep(); > > > > > > > > list_replace_init(&ioend->io_list, &tmp); > > > > + segments = ioend->io_segments; > > > > iomap_finish_ioend(ioend, error); > > > > > > > > while (!list_empty(&tmp)) { > > > > + if (segments > 32768) { > > > > + cond_resched(); > > > > + segments = 0; > > > > + } > > > > > > How is this intended to address the large bi_vec scenario? AFAICT > > > bio_segments() doesn't account for multipage bvecs so the above > > > logic > > > can allow something like 34b (?) 4k pages before a yield. > > > > Right now the bvec segment iteration in iomap_finish_ioend() is > > completely unaware of multipage bvecs - as per above > > bio_for_each_segment_all() iterates by PAGE_SIZE within a bvec, > > regardless of whether they are stored in a multipage bvec or not. > > Hence it always iterates the entire bio a single page at a time. > > > > IOWs, we don't use multi-page bvecs in iomap writeback, nor is it > > aware of them at all. We're adding single pages to bios via > > bio_add_page() which may merge them internally into multipage > > bvecs. > > However, all our iterators use single page interfaces, hence we > > don't see the internal multi-page structure of the bio at all. > > As such, bio_segments() should return the number of PAGE_SIZE pages > > attached to the bio regardless of it's internal structure. > > > > That is pretty much the point. The completion loop doesn't really > care > whether the amount of page processing work is due to a large bio > chain, > multipage bi_bvec(s), merged ioends, or some odd combination thereof. > As > you note, these conditions can manifest from various layers above or > below iomap. I don't think iomap really needs to know or care about > any > of this. It just needs to yield when it has spent "too much" time > processing pages. > > With regard to the iterators, my understanding was that > bio_for_each_segment_all() walks the multipage bvecs but > bio_for_each_segment() does not, but that could certainly be wrong as > I > find the iterators a bit confusing. Either way, the most recent test > with the ioend granular filter implies that a single ioend can still > become a soft lockup vector from non-atomic context. > > > That is what I see on a trace from a large single file submission, > > comparing bio_segments() output from the page count on an ioend: > > > > kworker/u67:2-187 [017] 13530.753548: iomap_do_writepage: 2. > > bios 4096, pages 4096, start sector 0x370400 bi_vcnt 1, bi_size > > 16777216 > > kworker/u67:2-187 [017] 13530.759706: iomap_do_writepage: 2. > > bios 4096, pages 4096, start sector 0x378400 bi_vcnt 1, bi_size > > 16777216 > > kworker/u67:2-187 [017] 13530.766326: iomap_do_writepage: 2. > > bios 4096, pages 4096, start sector 0x380400 bi_vcnt 1, bi_size > > 16777216 > > kworker/u67:2-187 [017] 13530.770689: iomap_do_writepage: 2. > > bios 4096, pages 4096, start sector 0x388400 bi_vcnt 1, bi_size > > 16777216 > > kworker/u67:2-187 [017] 13530.774716: iomap_do_writepage: 2. > > bios 4096, pages 4096, start sector 0x390400 bi_vcnt 1, bi_size > > 16777216 > > kworker/u67:2-187 [017] 13530.777157: iomap_writepages: 3. > > bios 2048, pages 2048, start sector 0x398400 bi_vcnt 1, bi_size > > 8388608 > > > > Which shows we are building ioends with a single bio with a single > > bvec, containing 4096 pages and 4096 bio segments. So, as expected, > > bio_segments() matches the page count and we submit 4096 page > > ioends > > with a single bio attached to it. > > > > This is clearly a case where we are getting physically contiguous > > page cache page allocation during write() syscalls, and the result > > is a single contiguous bvec from bio_add_page() doing physical page > > merging at the bvec level. Hence we see bio->bi_vcnt = 1 and a > > physically contiguous 4096 multipage bvec being dispatched. The > > lower layers slice and dice these huge bios to what the hardware > > can > > handle... > > > > I think we're in violent agreement here. That is the crux of > multipage > bvecs and what I've been trying to point out [1]. Ming (who I believe > implemented it) pointed this out back when the problem was first > reported. This is also why I asked Trond to test out the older patch > series, because that was intended to cover this case. > > [1] > https://lore.kernel.org/linux-xfs/20220104192321.GF31606@magnolia/T/#mc08ffe4b619c1b503b2c1342157bdaa9823167c1 > > > What I'm not yet reproducing is whatever vector that Trond is > > seeing > > that is causing the multi-second hold-offs. I get page completion > > processed at a rate of about a million pages per second per CPU, > > but > > I'm bandwidth limited to about 400,000 pages per second due to > > mapping->i_pages lock contention (reclaim vs page cache > > instantiation vs writeback completion). I'm not seeing merged ioend > > batches of larger than about 40,000 pages being processed at once. > > Hence I can't yet see where the millions of pages in a single ioend > > completion that would be required to hold a CPU for tens of seconds > > is coming from yet... > > > > I was never able to reproduce the actual warning either (only > construct > the unexpectedly large page sequences through various means), so I'm > equally as curious about that aspect of the problem. My only guess at > the moment is that perhaps hardware is enough of a factor to increase > the cost (i.e. slow cpu, cacheline misses, etc.)? I dunno.. > So I did try patches 1 and 2 from this series over the weekend, but for some reason the resulting kernel started hanging during I/O. I'm still trying to figure out why. > > I think I'll try Dave's new patch to see if that behaves similarly (in case this is something new in the stable kernel series) and report back. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx