On Thu, Apr 21, 2011 at 11:06:18PM +0800, Wu Fengguang wrote: [..] > > You can get meta data "throttling" and performance at the same time. > See below ideas. > > > > > > > Either way, you have the freedom to test whether the passed filp is a > > > normal file or a directory "file", and do conditional throttling. > > > > Ok, will look into it. That will probably take care of READS. What > > about WRITES and meta data. Is it safe to assume that any meta data > > write will come in some jounalling thread context and not in user > > process context? > > It's very possible to throttle meta data READS/WRITES, as long as they > can be attributed to the original task (assuming task oriented throttling > instead of bio/request oriented). Even in bio oriented throttling we attribute the bio to a task and hence to the group (atleast as of today). So from that perspective, it should not make much difference. > > The trick is to separate the concepts of THROTTLING and ACCOUNTING. > You can ACCOUNT data and meta data reads/writes to the right task, and > only to THROTTLE the task when it's doing data reads/writes. Agreed. I too mentioned this idea in one of the mails that account meta data but do not throttle meta data and use that meta data accounting to throttle data for longer period of times. For this to implement, I need to know whether an IO is regular IO or metadata IO and looks like one of the ways will that filesystems mark that info in bio for meta data requests. > > FYI I played the same trick for balance_dirty_pages_ratelimited() for > another reason: _accurate_ accounting of dirtied pages. > > That trick should play well with most applications who do interleaved > data and meta data reads/writes. For the special case of "find" who > does pure meta data reads, we can still throttle it by playing another > trick: to THROTTLE meta data reads/writes with a much higher threshold > than that of data. So normal applications will be almost always be > throttled at data accesses while "find" will be throttled at meta data > accesses. Ok, that makes sense. If an application is doing lots of meta data transactions only then try to limit it after some high limit I am not very sure if it will run into issues of some file system dependencies and hence priority inversion. > > For a real example of how it works, you can check this patch (plus the > attached one) Ok, I will go through the patches for more details. > > writeback: IO-less balance_dirty_pages() > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556 > > Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause > is the threshold for THROTTLING. When > > tsk->nr_dirtied > tsk->nr_dirtied_pause > > The task will voluntarily enter balance_dirty_pages() for taking a > nap (pause time will be proportional to tsk->nr_dirtied), and when > finished, start a new account-and-throttle period by resetting > tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more > reasonable pause time at next sleep. > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :) > Actually implementing throttling in balance_dirty_pages() is not hard. I think it has following issues. - One controls the IO rate coming into the page cache and does not control the IO rate at the outgoing devices. So a flusher thread can still throw lots of writes at a device and completely disrupting read latencies. If buffered WRITES can disrupt READ latencies unexpectedly, then it kind of renders IO controller/throttling useless. - For the application performance, I thought a better mechanism would be that we come up with per cgroup dirty ratio. This is equivalent to partitioning the page cache and coming up with cgroup's share. Now an application can write to this cache as fast as it want and is only throttled either by balance_dirty_pages() rules. All this IO must be going to some device and if an admin has put this cgroup in a low bandwidth group, then pages from this cgroup will be written slowly hence tasks in this group will be blocked for longer time. If we can make this work, then application can write to cache at higher rate at the same time not create a havoc at the end device. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html