On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote: > > Right, so what can we do to make it useful? I think the intent is to > > limit the number of pages in writeback and provide some progress > > feedback to the vm. > > > > Going by your experience we're failing there. > > Well, congestion_wait is a stop sign but not a queue. So, if you're > being nice and honoring congestion but another process (say O_DIRECT > random writes) doesn't, then you back off forever and none of your IO > gets done. > > To get around this, you can add code to make sure that you do > _some_ io, but this isn't enough for your work to get done > quickly, and you do end up waiting in get_request() so the async > benefits of using the congestion test go away. > > If we changed everyone to honor congestion, we end up with a poll model > because a ton of congestion_wait() callers create a thundering herd. > > So, we could add a queue, and then congestion_wait() would look a lot > like get_request_wait(). I'd rather that everyone just used > get_request_wait, and then have us fix any latency problems in the > elevator. Except you'd need to lift it to the BDI layer, because not all backing devices are a block device. Making it into a per-bdi queue sounds good to me though. > For me, perfect would be one or more threads per-bdi doing the > writeback, and never checking for congestion (like what Jens' code > does). The congestion_wait inside balance_dirty_pages() is really just > a schedule_timeout(), on a fully loaded box the congestion doesn't go > away anyway. We should switch that to a saner system of waiting for > progress on the bdi writeback + dirty thresholds. Right, one of the things we could possibly do is tie into __bdi_writeout_inc() and test levels there once every so often and then flip a bit when we're low enough to stop writing. > Btrfs would love to be able to send down a bio non-blocking. That would > let me get rid of the congestion check I have today (I think Jens said > that would be an easy change and then I talked him into some small mods > of the writeback path). Wont that land us into trouble because the amount of writeback will become unwieldy? > > > > Now, suppose it were to do something useful, I'd think we'd want to > > > > limit write-out to whatever it takes so saturate the BDI. > > > > > > If we don't want a blanket increase, > > > > The thing is, this sysctl seems an utter cop out, we can't even explain > > how to calculate a number that'll work for a situation, the best we can > > do is say, prod at it and pray -- that's not good. > > > > Last time I also asked if an increased number is good for every > > situation, I have a machine with a RAID5 array and USB storage, will it > > harm either situation? > > If the goal is to make sure that pdflush or balance_dirty_pages only > does IO until some condition is met, we should add a flag to the bdi > that gets set when that condition is met. Things will go a lot more > smoothly than magic numbers. Agreed - and from what I can make out, that really is the only goal here. > Then we can add the fs_hint as another change so the FS can tell > write_cache_pages callers how to do optimal IO based on its allocation > decisions. I think you lost me here, but I think you mean to provide some FS specific feedback to the generic write page routines -- whatever works ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html