Re: Is congestion broken?

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 24 Sep 2019 14:16:43 +0200

On Mon 23-09-19 13:51:21, Jens Axboe wrote:
> On 9/23/19 1:45 PM, Matthew Wilcox wrote:
> > On Mon, Sep 23, 2019 at 01:38:23PM -0600, Jens Axboe wrote:
> >> On 9/23/19 5:19 AM, Matthew Wilcox wrote:
> >>>
> >>> Ping Jens?
> >>>
> >>> On Wed, Sep 18, 2019 at 08:49:49PM -0700, Matthew Wilcox wrote:
> >>>> On Thu, Sep 19, 2019 at 10:33:10AM +0800, Lin Feng wrote:
> >>>>> On 9/18/19 20:33, Michal Hocko wrote:
> >>>>>> I absolutely agree here. From you changelog it is also not clear what is
> >>>>>> the underlying problem. Both congestion_wait and wait_iff_congested
> >>>>>> should wake up early if the congestion is handled. Is this not the case?
> >>>>>
> >>>>> For now I don't know why, codes seem should work as you said, maybe I need to
> >>>>> trace more of the internals.
> >>>>> But weird thing is that once I set the people-disliked-tunable iowait
> >>>>> drop down instantly, this is contradictory to the code design.
> >>>>
> >>>> Yes, this is quite strange.  If setting a smaller timeout makes a
> >>>> difference, that indicates we're not waking up soon enough.  I see
> >>>> two possibilities; one is that a wakeup is missing somewhere -- ie the
> >>>> conditions under which we call clear_wb_congested() are wrong.  Or we
> >>>> need to wake up sooner.
> >>>>
> >>>> Umm.  We have clear_wb_congested() called from exactly one spot --
> >>>> clear_bdi_congested().  That is only called from:
> >>>>
> >>>> drivers/block/pktcdvd.c
> >>>> fs/ceph/addr.c
> >>>> fs/fuse/control.c
> >>>> fs/fuse/dev.c
> >>>> fs/nfs/write.c
> >>>>
> >>>> Jens, is something supposed to be calling clear_bdi_congested() in the
> >>>> block layer?  blk_clear_congested() used to exist until October 29th
> >>>> last year.  Or is something else supposed to be waking up tasks that
> >>>> are sleeping on congestion?
> >>
> >> Congestion isn't there anymore. It was always broken as a concept imho,
> >> since it was inherently racy. We used the old batching mechanism in the
> >> legacy stack to signal it, and it only worked for some devices.
> > 
> > Umm.  OK.  Well, something that used to work is now broken.  So how
> 
> It didn't really...

Maybe it would have been better to tell users who are using interface
that rely on this.

> > should we fix it?  Take a look at shrink_node() in mm/vmscan.c.  If we've
> > submitted a lot of writes to a device, and overloaded it, we want to
> > sleep until it's able to take more writes:
> > 
> >                  /*
> >                   * Stall direct reclaim for IO completions if underlying BDIs
> >                   * and node is congested. Allow kswapd to continue until it
> >                   * starts encountering unqueued dirty pages or cycling through
> >                   * the LRU too quickly.
> >                   */
> >                  if (!sc->hibernation_mode && !current_is_kswapd() &&
> >                     current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> >                          wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> > 
> > With a standard block device, that now sleeps until the timeout (100ms)
> > expires, which is far too long for a modern SSD but is probably tuned
> > just right for some legacy piece of spinning rust (or indeed a modern
> > USB stick).  How would the block layer like to indicate to the mm layer
> > "I am too busy, please let the device work for a bit"?
> 
> Maybe base the sleep on the bdi write speed? We can't feasibly tell you
> if something is congested. It used to sort of work on things like sata
> drives, since we'd get congested when we hit the queue limit and that
> wasn't THAT far off with reality. Didn't work on SCSI with higher queue
> depths, and certainly doesn't work on NVMe where most devices have very
> deep queues.
> 
> Or we can have something that does "sleep until X requests/MB have been
> flushed", something that the vm would actively call. Combined with a
> timeout as well, probably.
> 
> For the vm case above, it's further complicated by it being global
> state. I think you'd be better off just making the delay smaller.  100ms
> is an eternity, and 10ms wakeups isn't going to cause any major issues
> in terms of CPU usage. If we're calling the above wait_iff_congested(),
> it better because we're otherwise SOL.

I do not mind using a shorter sleeps. A downside would be that very slow
storages could go OOM sooner because we do not wait long enough on one
side or reintroducing stalls during direct reclaim fixed by 0e093d99763eb.

I do agree that our poor's man congestion feedback mechanism is far from
being great. It would be really great to have something more resembling
a real feedback though. The global aspect of the thing is surely not
helping much but the reclaim is encountering pages from different bdis
and accounting each of them sounds like an overkill and too elaborate to
implement. Would it be possible to have something like "throttle on at
least one too busy bdi"?
-- 
Michal Hocko
SUSE Labs