Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 11 Jan 2022 14:33:11 +0000

On Mon, 2022-01-10 at 12:45 -0500, Brian Foster wrote:
> On Mon, Jan 10, 2022 at 07:18:47PM +1100, Dave Chinner wrote:
> > On Thu, Jan 06, 2022 at 11:44:23AM -0500, Brian Foster wrote:
> > > On Thu, Jan 06, 2022 at 09:04:21AM +1100, Dave Chinner wrote:
> > > > On Wed, Jan 05, 2022 at 08:56:33AM -0500, Brian Foster wrote:
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 71a36ae120ee..39214577bc46 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1066,17 +1066,34 @@ iomap_finish_ioend(struct iomap_ioend
> > > > *ioend, int error)
> > > >         }
> > > >  }
> > > >  
> > > > +/*
> > > > + * Ioend completion routine for merged bios. This can only be
> > > > called from task
> > > > + * contexts as merged ioends can be of unbound length. Hence
> > > > we have to break up
> > > > + * the page writeback completion into manageable chunks to
> > > > avoid long scheduler
> > > > + * holdoffs. We aim to keep scheduler holdoffs down below 10ms
> > > > so that we get
> > > > + * good batch processing throughput without creating adverse
> > > > scheduler latency
> > > > + * conditions.
> > > > + */
> > > >  void
> > > >  iomap_finish_ioends(struct iomap_ioend *ioend, int error)
> > > >  {
> > > >         struct list_head tmp;
> > > > +       int segments;
> > > > +
> > > > +       might_sleep();
> > > >  
> > > >         list_replace_init(&ioend->io_list, &tmp);
> > > > +       segments = ioend->io_segments;
> > > >         iomap_finish_ioend(ioend, error);
> > > >  
> > > >         while (!list_empty(&tmp)) {
> > > > +               if (segments > 32768) {
> > > > +                       cond_resched();
> > > > +                       segments = 0;
> > > > +               }
> > > 
> > > How is this intended to address the large bi_vec scenario? AFAICT
> > > bio_segments() doesn't account for multipage bvecs so the above
> > > logic
> > > can allow something like 34b (?) 4k pages before a yield.
> > 
> > Right now the bvec segment iteration in iomap_finish_ioend() is
> > completely unaware of multipage bvecs - as per above
> > bio_for_each_segment_all() iterates by PAGE_SIZE within a bvec,
> > regardless of whether they are stored in a multipage bvec or not.
> > Hence it always iterates the entire bio a single page at a time.
> > 
> > IOWs, we don't use multi-page bvecs in iomap writeback, nor is it
> > aware of them at all. We're adding single pages to bios via
> > bio_add_page() which may merge them internally into multipage
> > bvecs.
> > However, all our iterators use single page interfaces, hence we
> > don't see the internal multi-page structure of the bio at all.
> > As such, bio_segments() should return the number of PAGE_SIZE pages
> > attached to the bio regardless of it's internal structure.
> > 
> 
> That is pretty much the point. The completion loop doesn't really
> care
> whether the amount of page processing work is due to a large bio
> chain,
> multipage bi_bvec(s), merged ioends, or some odd combination thereof.
> As
> you note, these conditions can manifest from various layers above or
> below iomap. I don't think iomap really needs to know or care about
> any
> of this. It just needs to yield when it has spent "too much" time
> processing pages.
> 
> With regard to the iterators, my understanding was that
> bio_for_each_segment_all() walks the multipage bvecs but
> bio_for_each_segment() does not, but that could certainly be wrong as
> I
> find the iterators a bit confusing. Either way, the most recent test
> with the ioend granular filter implies that a single ioend can still
> become a soft lockup vector from non-atomic context.
> 
> > That is what I see on a trace from a large single file submission,
> > comparing bio_segments() output from the page count on an ioend:
> > 
> >    kworker/u67:2-187   [017] 13530.753548: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x370400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.759706: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x378400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.766326: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x380400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.770689: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x388400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.774716: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x390400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.777157: iomap_writepages: 3.
> > bios 2048, pages 2048, start sector 0x398400 bi_vcnt 1, bi_size
> > 8388608
> > 
> > Which shows we are building ioends with a single bio with a single
> > bvec, containing 4096 pages and 4096 bio segments. So, as expected,
> > bio_segments() matches the page count and we submit 4096 page
> > ioends
> > with a single bio attached to it.
> > 
> > This is clearly a case where we are getting physically contiguous
> > page cache page allocation during write() syscalls, and the result
> > is a single contiguous bvec from bio_add_page() doing physical page
> > merging at the bvec level. Hence we see bio->bi_vcnt = 1 and a
> > physically contiguous 4096 multipage bvec being dispatched. The
> > lower layers slice and dice these huge bios to what the hardware
> > can
> > handle...
> > 
> 
> I think we're in violent agreement here. That is the crux of
> multipage
> bvecs and what I've been trying to point out [1]. Ming (who I believe
> implemented it) pointed this out back when the problem was first
> reported. This is also why I asked Trond to test out the older patch
> series, because that was intended to cover this case.
> 
> [1]
> https://lore.kernel.org/linux-xfs/20220104192321.GF31606@magnolia/T/#mc08ffe4b619c1b503b2c1342157bdaa9823167c1
> 
> > What I'm not yet reproducing is whatever vector that Trond is
> > seeing
> > that is causing the multi-second hold-offs. I get page completion
> > processed at a rate of about a million pages per second per CPU,
> > but
> > I'm bandwidth limited to about 400,000 pages per second due to
> > mapping->i_pages lock contention (reclaim vs page cache
> > instantiation vs writeback completion). I'm not seeing merged ioend
> > batches of larger than about 40,000 pages being processed at once.
> > Hence I can't yet see where the millions of pages in a single ioend
> > completion that would be required to hold a CPU for tens of seconds
> > is coming from yet...
> > 
> 
> I was never able to reproduce the actual warning either (only
> construct
> the unexpectedly large page sequences through various means), so I'm
> equally as curious about that aspect of the problem. My only guess at
> the moment is that perhaps hardware is enough of a factor to increase
> the cost (i.e. slow cpu, cacheline misses, etc.)? I dunno..
> 

So I did try patches 1 and 2 from this series over the weekend, but for
some reason the resulting kernel started hanging during I/O. I'm still
trying to figure out why.

> > 
I think I'll try Dave's new patch to see if that behaves similarly (in
case this is something new in the stable kernel series) and report
back.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx