On Fri 15-04-11 23:06:02, Vivek Goyal wrote: > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > > How about doing throttling at two layers. All the data throttling is > > done in higher layers and then also retain the mechanism of throttling > > at end device. That way an admin can put a overall limit on such > > common write traffic. (XFS meta data coming from workqueues, flusher > > thread, kswapd etc). > > > > Anyway, we can't attribute this IO to per process context/group otherwise > > most likely something will get serialized in higher layers. > > > > Right now I am speaking purely from IO throttling point of view and not > > even thinking about CFQ and IO tracking stuff. > > > > This increases the complexity in IO cgroup interface as now we see to have > > four combinations. > > > > Global Throttling > > Throttling at lower layers > > Throttling at higher layers. > > > > Per device throttling > > Throttling at lower layers > > Throttling at higher layers. > > Dave, > > I wrote above but I myself am not fond of coming up with 4 combinations. > Want to limit it two. Per device throttling or global throttling. Here > are some more thoughts in general about both throttling policy and > proportional policy of IO controller. For throttling policy, I am > primarily concerned with how to avoid file system serialization issues. > > Proportional IO (CFQ) > --------------------- > - Make writeback cgroup aware and kernel threads (flusher) which are > cgroup aware can be marked with a task flag (GROUP_AWARE). If a > cgroup aware kernel threads throws IO at CFQ, then IO is accounted > to cgroup of task who originally dirtied the page. Otherwise we use > task context to account the IO to. > > So any IO submitted by flusher threads will go to respective cgroups > and higher weight cgroup should be able to do more WRITES. > > IO submitted by other kernel threads like kjournald, XFS async metadata > submission, kswapd etc all goes to thread context and that is root > group. > > - If kswapd is a concern then either make kswapd cgroup aware or let > kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). > > Open Issues > ----------- > - We do not get isolation for meta data IO. In virtualized setup, to > achieve stronger isolation do not use host filesystem. Export block > devices into guests. > > IO throttling > ------------ > > READS > ----- > - Do not throttle meta data IO. Filesystem needs to mark READ metadata > IO so that we can avoid throttling it. This way ordered filesystems > will not get serialized behind a throttled read in slow group. > > May be one can account meta data read to a group and try to use that > to throttle data IO in same cgroup as a compensation. > > WRITES > ------ > - Throttle tasks. Do not throttle bios. That means that when a task > submits direct write, let it go to disk. Do the accounting and if task > is exceeding the IO rate make it sleep. Something similar to > balance_dirty_pages(). > > That way, any direct WRITES should not run into any serialization issues > in ordered mode. We can continue to use blkio_throtle_bio() hook in > generic_make request(). > > - For buffered WRITES, design a throttling hook similar to > balance_drity_pages() and throttle tasks according to rules while they > are dirtying page cache. > > - Do not throttle buffered writes again at the end device as these have > been throttled already while writting to page cache. Also throttling > WRITES at end device will lead to serialization issues with file systems > in ordered mode. > > - Cgroup of a IO is always attributed to submitting thread. That way all > meta data writes will go in root cgroup and remain unthrottled. If one > is too concerned with lots of meta data IO, then probably one can > put a throttling rule in root cgroup. But I think the above scheme basically allows agressive buffered writer to occupy as much of disk throughput as throttling at page dirty time allows. So either you'd have to seriously limit the speed of page dirtying for each cgroup (effectively giving each write properties like direct write) or you'd have to live with cgroup taking your whole disk throughput. Neither of which seems very appealing. Grumble, not that I have a good solution to this problem... > Open Issues > ----------- > - IO spikes at end devices > > Because buffered writes are controlled at page dirtying time, we can > have a spike of IO later at end device when flusher thread decides to > do writeback. > > I am not sure how to solve this issue. Part of the problem can be > handled by using per cgroup dirty ratio and keeping each cgroup's > ratio low so that we don't build up huge dirty caches. This can lead > to performance drop of applications. So this is performance vs isolation > trade off and user chooses one. > > This issue exists in virtualized environment only if host file system > is used. The best way to achieve maximum isolation would be to export > block devices into guest and then perform throttling per block device. > > - Poor isolation for meta data. > > We can't account and throttle meta data in each cgroup otherwise we > should again run into file system serialization issues in ordered > mode. So this is a trade off of using file systems. You primarily get > throttling for data IO and not meta data IO. > > Again, export block devices in virtual machines and create file systems > on that and do not use host filesystem and one can achieve a very good > isolation. > > Thoughts? Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html