On Tue, Sep 08, 2009 at 07:55:01PM +0200, Peter Zijlstra wrote: > On Tue, 2009-09-08 at 19:46 +0200, Peter Zijlstra wrote: > > On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote: > > > > Right, so what can we do to make it useful? I think the intent is to > > > > limit the number of pages in writeback and provide some progress > > > > feedback to the vm. > > > > > > > > Going by your experience we're failing there. > > > > > > Well, congestion_wait is a stop sign but not a queue. So, if you're > > > being nice and honoring congestion but another process (say O_DIRECT > > > random writes) doesn't, then you back off forever and none of your IO > > > gets done. > > > > > > To get around this, you can add code to make sure that you do > > > _some_ io, but this isn't enough for your work to get done > > > quickly, and you do end up waiting in get_request() so the async > > > benefits of using the congestion test go away. > > > > > > If we changed everyone to honor congestion, we end up with a poll model > > > because a ton of congestion_wait() callers create a thundering herd. > > > > > > So, we could add a queue, and then congestion_wait() would look a lot > > > like get_request_wait(). I'd rather that everyone just used > > > get_request_wait, and then have us fix any latency problems in the > > > elevator. > > > > Except you'd need to lift it to the BDI layer, because not all backing > > devices are a block device. > > > > Making it into a per-bdi queue sounds good to me though. > > > > > For me, perfect would be one or more threads per-bdi doing the > > > writeback, and never checking for congestion (like what Jens' code > > > does). The congestion_wait inside balance_dirty_pages() is really just > > > a schedule_timeout(), on a fully loaded box the congestion doesn't go > > > away anyway. We should switch that to a saner system of waiting for > > > progress on the bdi writeback + dirty thresholds. > > > > Right, one of the things we could possibly do is tie into > > __bdi_writeout_inc() and test levels there once every so often and then > > flip a bit when we're low enough to stop writing. > > I think I'm somewhat confused here though.. > > There's kernel threads doing writeout, and there's apps getting stuck in > balance_dirty_pages(). > > If we want all writeout to be done by kernel threads (bdi/pd-flush like > things) then we still need to manage the actual apps and delay them. > > As things stand now, we kick pdflush into action when dirty levels are > above the background level, and start writing out from the app task when > we hit the full dirty level. > > Moving all writeout to a kernel thread sounds good from writing linear > stuff pov, but what do we make apps wait on then? I suppose we could come up with the perfect queuing system where procs got in line and came out as the bdi became less busy. The problem is that schedule_timeout(HZ/10) isn't really a great idea because HZ/10 might be much much too long for fast devices. congestion_wait() isn't a great idea because the block device might stay congested long after we've crossed below the threshold. If there was a flag on the bdi that got cleared as things improved, we could wait on that. Otherwise, schedule_timeout() with increasing timeout values per iteration and a poll on the thresholds isn't too far from what we have now. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html