On Thu, Jan 06, 2022 at 11:44:23AM -0500, Brian Foster wrote: > On Thu, Jan 06, 2022 at 09:04:21AM +1100, Dave Chinner wrote: > > On Wed, Jan 05, 2022 at 08:56:33AM -0500, Brian Foster wrote: > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 71a36ae120ee..39214577bc46 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1066,17 +1066,34 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error) > > } > > } > > > > +/* > > + * Ioend completion routine for merged bios. This can only be called from task > > + * contexts as merged ioends can be of unbound length. Hence we have to break up > > + * the page writeback completion into manageable chunks to avoid long scheduler > > + * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get > > + * good batch processing throughput without creating adverse scheduler latency > > + * conditions. > > + */ > > void > > iomap_finish_ioends(struct iomap_ioend *ioend, int error) > > { > > struct list_head tmp; > > + int segments; > > + > > + might_sleep(); > > > > list_replace_init(&ioend->io_list, &tmp); > > + segments = ioend->io_segments; > > iomap_finish_ioend(ioend, error); > > > > while (!list_empty(&tmp)) { > > + if (segments > 32768) { > > + cond_resched(); > > + segments = 0; > > + } > > How is this intended to address the large bi_vec scenario? AFAICT > bio_segments() doesn't account for multipage bvecs so the above logic > can allow something like 34b (?) 4k pages before a yield. Right now the bvec segment iteration in iomap_finish_ioend() is completely unaware of multipage bvecs - as per above bio_for_each_segment_all() iterates by PAGE_SIZE within a bvec, regardless of whether they are stored in a multipage bvec or not. Hence it always iterates the entire bio a single page at a time. IOWs, we don't use multi-page bvecs in iomap writeback, nor is it aware of them at all. We're adding single pages to bios via bio_add_page() which may merge them internally into multipage bvecs. However, all our iterators use single page interfaces, hence we don't see the internal multi-page structure of the bio at all. As such, bio_segments() should return the number of PAGE_SIZE pages attached to the bio regardless of it's internal structure. That is what I see on a trace from a large single file submission, comparing bio_segments() output from the page count on an ioend: kworker/u67:2-187 [017] 13530.753548: iomap_do_writepage: 2. bios 4096, pages 4096, start sector 0x370400 bi_vcnt 1, bi_size 16777216 kworker/u67:2-187 [017] 13530.759706: iomap_do_writepage: 2. bios 4096, pages 4096, start sector 0x378400 bi_vcnt 1, bi_size 16777216 kworker/u67:2-187 [017] 13530.766326: iomap_do_writepage: 2. bios 4096, pages 4096, start sector 0x380400 bi_vcnt 1, bi_size 16777216 kworker/u67:2-187 [017] 13530.770689: iomap_do_writepage: 2. bios 4096, pages 4096, start sector 0x388400 bi_vcnt 1, bi_size 16777216 kworker/u67:2-187 [017] 13530.774716: iomap_do_writepage: 2. bios 4096, pages 4096, start sector 0x390400 bi_vcnt 1, bi_size 16777216 kworker/u67:2-187 [017] 13530.777157: iomap_writepages: 3. bios 2048, pages 2048, start sector 0x398400 bi_vcnt 1, bi_size 8388608 Which shows we are building ioends with a single bio with a single bvec, containing 4096 pages and 4096 bio segments. So, as expected, bio_segments() matches the page count and we submit 4096 page ioends with a single bio attached to it. This is clearly a case where we are getting physically contiguous page cache page allocation during write() syscalls, and the result is a single contiguous bvec from bio_add_page() doing physical page merging at the bvec level. Hence we see bio->bi_vcnt = 1 and a physically contiguous 4096 multipage bvec being dispatched. The lower layers slice and dice these huge bios to what the hardware can handle... What I'm not yet reproducing is whatever vector that Trond is seeing that is causing the multi-second hold-offs. I get page completion processed at a rate of about a million pages per second per CPU, but I'm bandwidth limited to about 400,000 pages per second due to mapping->i_pages lock contention (reclaim vs page cache instantiation vs writeback completion). I'm not seeing merged ioend batches of larger than about 40,000 pages being processed at once. Hence I can't yet see where the millions of pages in a single ioend completion that would be required to hold a CPU for tens of seconds is coming from yet... > That aside, I find the approach odd in that we calculate the segment > count for each bio via additional iteration (which is how bio_segments() > works) and track the summation of the chain in the ioend only to provide > iomap_finish_ioends() with a subtly inaccurate view of how much work > iomap_finish_ioend() is doing as the loop iterates. I just did that so I didn't have to count pages as the bio is built. Easy to change - in fact I have changed it to check that bio_segments() was returning the page count I expected it should be returning.... I also changed the completion side to just count end_page_writeback() calls, and I get the same number of cond_resched() calls being made as the bio_segment. So AFAICT there's no change of behaviour or accounting between the two methods, and I'm not sure where the latest problem Trond reported is... > We already have this > information in completion context and iomap_finish_ioends() is just a > small iterator function, so I don't understand why we wouldn't do > something like factor these two loops into a non-atomic context only > variant that yields based on the actual amount of page processing work > being done (i.e. including multipage bvecs). That seems more robust and > simple to me, but that's just my .02. iomap_finish_ioends() is pretty much that non-atomic version of the ioend completion code. Merged ioend chains cannot be sanely handled in atomic context and so it has to be called from task context. Hence the "might_sleep()" I added to ensure that we get warnings if it is called from atomic contexts. As for limiting atomic context completion processing, we've historically done that by limiting the size of individual IO chains submitted during writeback. This means that atomic completion contexts don't need any special signalling (i.e. conditional "in_atomic()" behaviour) because they aren't given anything to process that would cause problems in atomic contexts... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx