* KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> [2009-09-25 10:18:21]: > On Fri, 25 Sep 2009 10:09:52 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > > > On Thu, 24 Sep 2009 14:33:15 -0700 > > Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > > =================================================================== > > > > Fairness for async writes is tricky and biggest reason is that async writes > > > > are cached in higher layers (page cahe) as well as possibly in file system > > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > > in proportional manner. > > > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > > service differentation. > > > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > > does not throw enought IO traffic at IO controller to keep the queue > > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > > intervals where higher weight queue is empty and in that duration lower weight > > > > queue get lots of job done giving the impression that there was no service > > > > differentiation. > > > > > > > > In summary, from IO controller point of view async writes support is there. > > > > Because page cache has not been designed in such a manner that higher > > > > prio/weight writer can do more write out as compared to lower prio/weight > > > > writer, gettting service differentiation is hard and it is visible in some > > > > cases and not visible in some cases. > > > > > > Here's where it all falls to pieces. > > > > > > For async writeback we just don't care about IO priorities. Because > > > from the point of view of the userspace task, the write was async! It > > > occurred at memory bandwidth speed. > > > > > > It's only when the kernel's dirty memory thresholds start to get > > > exceeded that we start to care about prioritisation. And at that time, > > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > > consumes just as much memory as a low-ioprio dirty page. > > > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > > > I suppose that all we can do is to block low-ioprio processes more > > > agressively at the VFS layer, to reduce the rate at which they're > > > dirtying memory so as to give high-ioprio processes more of the disk > > > bandwidth. > > > > > > But you've gone and implemented all of this stuff at the io-controller > > > level and not at the VFS level so you're, umm, screwed. > > > > > > > I think I must support dirty-ratio in memcg layer. But not yet. > We need to add this to the TODO list. > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. > And add a control knob as > bufferred_write.nr_dirty_thresh > to limit the number of dirty pages generetad via a cgroup. > > Because memcg just records a owner of pages but not records who makes them > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio > cgroup code. Very good point, this is crucial for shared pages. > > But I'm not sure how I should treat I/Os generated out by kswapd. > Account them to process 0 :) -- Balbir -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel