On Wed, Oct 14, 2009 at 07:22:28PM +0800, Peter Zijlstra wrote: > On Wed, 2009-10-14 at 09:38 +0800, Wu Fengguang wrote: > > > > Hmm, probably you've discussed this in some other email but why do we > > > > cycle in this loop until we get below dirty limit? We used to leave the > > > > loop after writing write_chunk... So the time we spend in > > > > balance_dirty_pages() is no longer limited, right? > > > > Right, this is a legitimate concern. > > Quite. > > > > Wu was saying that without the loop nr_writeback wasn't limited, but > > > since bdi_writeback_wakeup() is driven from writeout completion, I'm not > > > sure how again that was so. > > > > Let me summarize the ideas :) > > > > There are two cases: > > > > - there are no bdi or block io queue to limit nr_writeback > > This must be fixed. It either let nr_writeback grow to dirty_thresh > > (with loop) and thus squeeze nr_dirty, or grow out of control > > totally (without loop). Current state is, the nr_writeback wait > > queue for NFS is there; the one for btrfs is still missing. > > > > - there is a nr_writeback limit, but is larger than dirty_thresh > > In this case nr_dirty will be close to 0 regardless of the loop. > > The loop will help to keep > > nr_dirty + nr_writeback + nr_unstable < dirty_thresh > > Without the loop, the "real" dirty threshold would be larger > > (determined by the nr_writeback limit). > > > > > We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue > > > permits, but it cannot grow beyond the total limit, since we're actually > > > waiting for writeout completion. > > > > Yes, this explains the second case. It's some trade-off like: the > > nr_writeback limit can not be trusted in small memory systems, so do > > the loop to impose the dirty_thresh, which unfortunately can hurt > > responsiveness on all systems with prolonged wait time.. > > Ok, so I'm still puzzled. Big sorry - it's me that was confused (by some buggy tests). > set_page_dirty() > balance_dirty_pages_ratelimited() > balance_dirty_pages_ratelimited_nr(1) > balance_dirty_pages(nr); > > So we call balance_dirty_pages() with an appropriate count for each > set_page_dirty() successful invocation, right? Right. > balance_dirty_pages() guarantees that: > > nr_dirty + nr_writeback + nr_unstable < dirty_thresh && > (nr_dirty + nr_writeback + nr_unstable < > (dirty_thresh + background_thresh)/2 || > bdi_dirty + bdi_writeback + bdi_unstable < bdi_thresh) > > Now without loop, without writeback limit, I still see no way to > actually generate more 'dirty' pages than dirty_thresh. > > As soon as we hit dirty_thresh a process will wait for exactly the same > amount of pages to get cleaned (writeback completed) as were dirtied > (+/- the ratelimit fuzz which should even out over processes). Ah yes, we now wait for writeback _completed_ in bdi_writeback_wait(), instead of _start_ writeback in the old fashioned writeback_inodes(). > That should bound things to dirty_thresh -- the wait is on writeback > complete, so nr_writeback is bounded too. Right. It was not bounded in the tests because bdi_writeback_wait() quits _prematurely_, because the background writeback finds it was already under background threshold, and so wakeup the throttled tasks and then quit. Fixed by simply removing the wakeup-all in background writeback and this change: if (args->for_background && !over_bground_thresh() && + list_empty(&wb->bdi->throttle_list)) break; So now - the throttled tasks are guaranteed to be wakeup - it will only be wakeup in __bdi_writeout_inc() - once wakeup, at least write_chunk pages have been written on behalf of it > [ I forgot the exact semantics of unstable, if we clear writeback before > unstable, we need to fix something ] New tests show that NFS is working fine without loop and without NFS nr_writeback limit: $ dd if=/dev/zero of=/mnt/test/zero3 bs=1M count=200 & $ vmmon -d 1 nr_writeback nr_dirty nr_unstable nr_writeback nr_dirty nr_unstable 0 2 0 0 2 0 0 22477 65 2 20849 1697 2 19153 3393 2 17420 5126 27825 7 5979 27816 0 41 26925 0 907 31064 286 159 32531 0 213 32548 0 89 32405 0 155 32464 0 98 32517 0 45 32560 0 194 32534 0 220 32601 0 222 32490 0 72 32447 0 115 32511 0 48 32535 0 216 32535 0 216 32535 0 216 32535 0 216 31555 0 1180 29732 0 3003 29277 0 3458 27721 0 5014 25955 0 6780 24356 0 8379 22763 0 9972 21083 0 11652 19371 0 13364 17564 0 15171 15781 0 16954 14005 0 18730 12230 0 20505 12177 0 20558 11383 0 21352 9489 0 23246 7621 0 25115 5866 0 26870 4790 0 27947 2962 0 29773 1089 0 31646 0 0 32735 0 0 32735 0 0 0 0 0 0 > Now, a nr_writeback queue that limits writeback will still be useful, > esp for high speed devices. Once they ramp up and bdi_thresh exceeds the > queue size, it'll take effect. So you reap the benefits when needed. Right, the nr_writeback limit avoids nr_writeback => dirty_thresh + nr_dirty + nr_writeback < dirty_thresh => nr_dirty => 0 Thanks for the clarification, it looks less obscure now :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html