Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb

Chris Mason <chris.mason@xxxxxxxxxx> · Tue, 8 Sep 2009 14:35:26 -0400

On Tue, Sep 08, 2009 at 07:55:01PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 19:46 +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote:
> > > > Right, so what can we do to make it useful? I think the intent is to
> > > > limit the number of pages in writeback and provide some progress
> > > > feedback to the vm.
> > > > 
> > > > Going by your experience we're failing there.
> > > 
> > > Well, congestion_wait is a stop sign but not a queue.  So, if you're
> > > being nice and honoring congestion but another process (say O_DIRECT
> > > random writes) doesn't, then you back off forever and none of your IO
> > > gets done.
> > > 
> > > To get around this, you can add code to make sure that you do
> > > _some_ io, but this isn't enough for your work to get done
> > > quickly, and you do end up waiting in get_request() so the async
> > > benefits of using the congestion test go away.
> > > 
> > > If we changed everyone to honor congestion, we end up with a poll model
> > > because a ton of congestion_wait() callers create a thundering herd.
> > > 
> > > So, we could add a queue, and then congestion_wait() would look a lot
> > > like get_request_wait().  I'd rather that everyone just used
> > > get_request_wait, and then have us fix any latency problems in the
> > > elevator.
> > 
> > Except you'd need to lift it to the BDI layer, because not all backing
> > devices are a block device.
> > 
> > Making it into a per-bdi queue sounds good to me though.
> > 
> > > For me, perfect would be one or more threads per-bdi doing the
> > > writeback, and never checking for congestion (like what Jens' code
> > > does).  The congestion_wait inside balance_dirty_pages() is really just
> > > a schedule_timeout(), on a fully loaded box the congestion doesn't go
> > > away anyway.  We should switch that to a saner system of waiting for
> > > progress on the bdi writeback + dirty thresholds.
> > 
> > Right, one of the things we could possibly do is tie into
> > __bdi_writeout_inc() and test levels there once every so often and then
> > flip a bit when we're low enough to stop writing.
> 
> I think I'm somewhat confused here though..
> 
> There's kernel threads doing writeout, and there's apps getting stuck in
> balance_dirty_pages().
> 
> If we want all writeout to be done by kernel threads (bdi/pd-flush like
> things) then we still need to manage the actual apps and delay them.
> 
> As things stand now, we kick pdflush into action when dirty levels are
> above the background level, and start writing out from the app task when
> we hit the full dirty level.
> 
> Moving all writeout to a kernel thread sounds good from writing linear
> stuff pov, but what do we make apps wait on then?

I suppose we could come up with the perfect queuing system where procs
got in line and came out as the bdi became less busy.  The problem is
that schedule_timeout(HZ/10) isn't really a great idea because HZ/10
might be much much too long for fast devices.

congestion_wait() isn't a great idea because the block device might stay
congested long after we've crossed below the threshold.

If there was a flag on the bdi that got cleared as things improved, we
could wait on that.

Otherwise, schedule_timeout() with increasing timeout values per
iteration and a poll on the thresholds isn't too far from what we have
now.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html