On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote: > I/O performance is the bottleneck in many systems, from phones to > servers. Knowing which request to schedule at any moment is crucial to > systems that support interactive latencies and high throughput. When > you're watching a video on your desktop, you don't want it to skip > when you build a kernel. > > To address this in our environment Google has now deployed the > blk-cgroup code worldwide, and I'd like to share some of our > experiences. We've made modifications for our purposes, and are in the > process of proposing those upstream: > > - Page tracking for buffered writes > - Fairness-preserving preemption across cgroups Chad, This is definitely of interest to me (though I will not be around but will like to read LWN summary of discussion later. :-)). Would like to know more how google has deployed this and using this infrastructre. Also would like that all the missing pieces be pushed upstream (especially the buffered WRITE support and page tracking stuff). One thing I am curious to know that how do you deal with getting service differentiation while maintaining high throughput. Idling on group for fairness is more or less reasonable on single SATA disk but can very well kill performance (especially with random IO) on storage array or on fast SSDs. I have been thinking of disabling idling altogether and trying to change the position of group in the service tree based on weight when new IO comes (CFQ already does something similar for cfqq, slice_offset() logic). I have been thinking of doing similar while calculating vdisktime of group when it gets enqueued. This might give us some service differentiation while getting better throughput. You also mentioned about controlling latencies very tightly and that probably means driving shallower queue depths (may be 1) so that preemption is somewhat effective and latencies are better. But again driving lesser queue depth can lead to reduced performance. So I am curious how do you deal with that. Also curious to know if per memory cgroup dirty ration stuff got in and how did we deal with the issue of selecting which inode to dispatch the writes from based on the cgroup it belongs to. > > There is further work to do along the lines of fine-grained accounting > and isolation. For example, many file servers in a Google cluster will > do IO on behalf of hundreds, even thousands of clients. Each client > has different service requirements, and it's inefficient to map them > to (cgroup, task) pairs. So is it ioprio based isolation or soemthing else? Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html