Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit

Jan Kara <jack@xxxxxxx> · Wed, 13 Nov 2024 11:07:35 +0100

On Tue 12-11-24 16:45:39, Jim Zhao wrote:
> > On Fri 08-11-24 11:19:49, Jim Zhao wrote:
> > > > On Wed 23-10-24 18:00:32, Jim Zhao wrote:
> > > > > With the strictlimit flag, wb_thresh acts as a hard limit in
> > > > > balance_dirty_pages() and wb_position_ratio(). When device write
> > > > > operations are inactive, wb_thresh can drop to 0, causing writes to
> > > > > be blocked. The issue occasionally occurs in fuse fs, particularly
> > > > > with network backends, the write thread is blocked frequently during
> > > > > a period. To address it, this patch raises the minimum wb_thresh to a
> > > > > controllable level, similar to the non-strictlimit case.
> > > > >
> > > > > Signed-off-by: Jim Zhao <jimzhao.ai@xxxxxxxxx>
> > > >
> > > > ...
> > > >
> > > > > +       /*
> > > > > +        * With strictlimit flag, the wb_thresh is treated as
> > > > > +        * a hard limit in balance_dirty_pages() and wb_position_ratio().
> > > > > +        * It's possible that wb_thresh is close to zero, not because
> > > > > +        * the device is slow, but because it has been inactive.
> > > > > +        * To prevent occasional writes from being blocked, we raise wb_thresh.
> > > > > +        */
> > > > > +       if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> > > > > +               unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> > > > > +               u64 wb_scale_thresh = 0;
> > > > > +
> > > > > +               if (limit > dtc->dirty)
> > > > > +                       wb_scale_thresh = (limit - dtc->dirty) / 100;
> > > > > +               wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4));
> > > > > +       }
> > > >
> > > > What you propose makes sense in principle although I'd say this is mostly a
> > > > userspace setup issue - with strictlimit enabled, you're kind of expected
> > > > to set min_ratio exactly if you want to avoid these startup issues. But I
> > > > tend to agree that we can provide a bit of a slack for a bdi without
> > > > min_ratio configured to ramp up.
> > > >
> > > > But I'd rather pick the logic like:
> > > >
> > > >   /*
> > > >    * If bdi does not have min_ratio configured and it was inactive,
> > > >    * bump its min_ratio to 0.1% to provide it some room to ramp up.
> > > >    */
> > > >   if (!wb_min_ratio && !numerator)
> > > >           wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2);
> > > >
> > > > That would seem like a bit more systematic way than the formula you propose
> > > > above...
> > >
> > > Thanks for the advice.
> > > Here's the explanation of the formula:
> > > 1. when writes are small and intermittent，wb_thresh can approach 0, not
> > > just 0, making the numerator value difficult to verify.
> >
> > I see, ok.
> >
> > > 2. The ramp-up margin, whether 0.1% or another value, needs
> > > consideration.
> > > I based this on the logic of wb_position_ratio in the non-strictlimit
> > > scenario: wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); It seems
> > > provides more room and ensures ramping up within a controllable range.
> >
> > I see, thanks for explanation. So I was thinking how to make the code more
> > consistent instead of adding another special constant and workaround. What
> > I'd suggest is:
> >
> > 1) There's already code that's supposed to handle ramping up with
> > strictlimit in wb_update_dirty_ratelimit():
> >
> >         /*
> >          * For strictlimit case, calculations above were based on wb counters
> >          * and limits (starting from pos_ratio = wb_position_ratio() and up to
> >          * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
> >          * Hence, to calculate "step" properly, we have to use wb_dirty as
> >          * "dirty" and wb_setpoint as "setpoint".
> >          *
> >          * We rampup dirty_ratelimit forcibly if wb_dirty is low because
> >          * it's possible that wb_thresh is close to zero due to inactivity
> >          * of backing device.
> >          */
> >         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >                 dirty = dtc->wb_dirty;
> >                 if (dtc->wb_dirty < 8)
> >                         setpoint = dtc->wb_dirty + 1;
> >                 else
> >                         setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
> >         }
> >
> > Now I agree that increasing wb_thresh directly is more understandable and
> > transparent so I'd just drop this special case.
> 
> yes, I agree.
> 
> > 2) I'd just handle all the bumping of wb_thresh in a single place instead
> > of having is spread over multiple places. So __wb_calc_thresh() could have
> > a code like:
> >
> >         wb_thresh = (thresh * (100 * BDI_RATIO_SCALE - bdi_min_ratio)) / (100 * BDI_RATIO_SCALE)
> >         wb_thresh *= numerator;
> >         wb_thresh = div64_ul(wb_thresh, denominator);
> >
> >         wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);
> >
> >         wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE);
> >       limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> >         /*
> >          * It's very possible that wb_thresh is close to 0 not because the
> >          * device is slow, but that it has remained inactive for long time.
> >          * Honour such devices a reasonable good (hopefully IO efficient)
> >          * threshold, so that the occasional writes won't be blocked and active
> >          * writes can rampup the threshold quickly.
> >          */
> >       if (limit > dtc->dirty)
> >               wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
> >       if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE))
> >               wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE);
> >
> > and we can drop the bumping from wb_position)_ratio(). This way have the
> > wb_thresh bumping in a single logical place. Since we still limit wb_tresh
> > with max_ratio, untrusted bdis for which max_ratio should be configured
> > (otherwise they can grow amount of dirty pages upto global treshold anyway)
> > are still under control.
> >
> > If we really wanted, we could introduce a different bumping in case of
> > strictlimit, but at this point I don't think it is warranted so I'd leave
> > that as an option if someone comes with a situation where this bumping
> > proves to be too aggressive.
> 
> Thank you, this is very helpful. And I have 2 concerns:
> 
> 1.
> In the current non-strictlimit logic, wb_thresh is only bumped within
> wb_position_ratio() for calculating pos_ratio, and this bump isn’t
> restricted by max_ratio.  I’m unsure if moving this adjustment to
> __wb_calc_thresh() would effect existing behavior.  Would it be possible
> to keep the current logic for non-strictlimit case?

You are correct that current bumping is not affected by max_ratio and that
is actually a bug. wb_thresh should never exceed what is corresponding to
the configured max_ratio. Furthermore in practical configurations I don't
think the max_ratio limiting will actually make a big difference because
bumping should happen when wb_thresh is really low. So for consistency I
would apply it also to the non-strictlimit case.

> 2. Regarding the formula:
> wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
> 
> Consider a case: 
> With 100 fuse devices(with high max_ratio) experiencing high writeback
> delays, the pages being written back are accounted in NR_WRITEBACK_TEMP,
> not dtc->dirty.  As a result, the bumped wb_thresh may remain high. While
> individual devices are under control, the total could exceed
> expectations.

I agree but this is a potential problem with any kind of bumping based on
'limit - dtc->dirty'. It is just a matter of how many fuse devices you have
and how exactly you have max_ratio configured.

> Although lowering the max_ratio can avoid this issue, how about reducing
> the bumped wb_thresh?
> 
> The formula in my patch:
> wb_scale_thresh = (limit - dtc->dirty) / 100;
> The intention is to use the default fuse max_ratio(1%) as the multiplier.

So basically you propose to use the "/ 8" factor for the normal case and "/
100" factor for the strictlimit case. My position is that I would not
complicate the logic unless somebody comes with a real world setup where
the simpler logic is causing real problems. But if you feel strongly about
this, I'm fine with that option.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR