Re: [LSF/FS TOPIC] I/O performance isolation for shared storage

Vivek Goyal <vgoyal@xxxxxxxxxx> · Thu, 3 Feb 2011 21:31:44 -0500

On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote:
> I/O performance is the bottleneck in many systems, from phones to
> servers. Knowing which request to schedule at any moment is crucial to
> systems that support interactive latencies and high throughput.  When
> you're watching a video on your desktop, you don't want it to skip
> when you build a kernel.
> 
> To address this in our environment Google has now deployed the
> blk-cgroup code worldwide, and I'd like to share some of our
> experiences. We've made modifications for our purposes, and are in the
> process of proposing those upstream:
> 
>   - Page tracking for buffered writes
>   - Fairness-preserving preemption across cgroups

Chad,

This is definitely of interest to me (though I will not be around but
will like to read LWN summary of discussion later. :-)). Would like to
know more how google has deployed this and using this infrastructre. Also
would like that all the missing pieces be pushed upstream (especially
the buffered WRITE support and page tracking stuff).

One thing I am curious to know that how do you deal with getting service
differentiation while maintaining high throughput. Idling on group for
fairness is more or less reasonable on single SATA disk but can very
well kill performance (especially with random IO) on storage array or
on fast SSDs.

I have been thinking of disabling idling altogether and trying to change the
position of group in the service tree based on weight when new IO comes
(CFQ already does something similar for cfqq, slice_offset() logic). I 
have been thinking of doing similar while calculating vdisktime of group
when it gets enqueued. This might give us some service differentiation
while getting better throughput.

You also mentioned about controlling latencies very tightly and that 
probably means driving shallower queue depths (may be 1) so that preemption
is somewhat effective and latencies are better. But again driving lesser queue
depth can lead to reduced performance. So I am curious how do you deal with
that.

Also curious to know if per memory cgroup dirty ration stuff got in and how
did we deal with the issue of selecting which inode to dispatch the writes
from based on the cgroup it belongs to. 

> 
> There is further work to do along the lines of fine-grained accounting
> and isolation. For example, many file servers in a Google cluster will
> do IO on behalf of hundreds, even thousands of clients. Each client
> has different service requirements, and it's inefficient to map them
> to (cgroup, task) pairs.

So is it ioprio based isolation or soemthing else?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html