Re: [LSF/FS TOPIC] I/O performance isolation for shared storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 3, 2011 at 6:31 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote:
> This is definitely of interest to me (though I will not be around but
> will like to read LWN summary of discussion later. :-)). Would like to
> know more how google has deployed this and using this infrastructre. Also
> would like that all the missing pieces be pushed upstream (especially
> the buffered WRITE support and page tracking stuff).

Pushing this all upstream is my current focus, so you'll be getting a
lot of patches in your inbox in the coming weeks.

> One thing I am curious to know that how do you deal with getting service
> differentiation while maintaining high throughput. Idling on group for
> fairness is more or less reasonable on single SATA disk but can very
> well kill performance (especially with random IO) on storage array or
> on fast SSDs.

We've sidestepped that problem by deploying the blk-cgroup scheduler
against single spinning media drives.  Fast SSDs present another set
of problems.  It's not clear to me that CFQ is the right starting
place for a scheduler for SSDs.  Much of the structure of the code
reflects its design for spinning media drives.

> I have been thinking of disabling idling altogether and trying to change the
> position of group in the service tree based on weight when new IO comes
> (CFQ already does something similar for cfqq, slice_offset() logic). I
> have been thinking of doing similar while calculating vdisktime of group
> when it gets enqueued. This might give us some service differentiation
> while getting better throughput.

I'd like to hear more about this.  It's not clear to me that idling
would be necessary for throughput on a device with a deep queue.  In
my mind idling is used only to get better throughput by avoiding seeks
introduced when switching between synchronous tasks.

> You also mentioned about controlling latencies very tightly and that
> probably means driving shallower queue depths (may be 1) so that preemption
> is somewhat effective and latencies are better. But again driving lesser queue
> depth can lead to reduced performance. So I am curious how do you deal with
> that.

We've currently just made the trade-off that you're pointing out.
We've chosen to limit queue depth and then leaned heavily on idling
for sequential, synchronous, well-behaved applications to maintain
throughput.  I think supporting high throughput and low-latency with
many random workloads is still an open area.

> Also curious to know if per memory cgroup dirty ration stuff got in and how
> did we deal with the issue of selecting which inode to dispatch the writes
> from based on the cgroup it belongs to.

We have some experience with per-cgroup writeback under our fake-NUMA
memory container system. Writeback under memcg will likely face
similar issues.  See Greg Thelen's topic description at
http://article.gmane.org/gmane.linux.kernel.mm/58164 for a request for
discussion.

Per-cgroup dirty ratios is just the beginning, as you mention.  Unless
the IO scheduler can see the deep queues of all the blocked tasks, it
can't make the right decisions.  Also, today writeback is ignorant of
the tasks' debt to the IO scheduler, so it issues the "wrong" inodes.

>> There is further work to do along the lines of fine-grained accounting
>> and isolation. For example, many file servers in a Google cluster will
>> do IO on behalf of hundreds, even thousands of clients. Each client
>> has different service requirements, and it's inefficient to map them
>> to (cgroup, task) pairs.
>
> So is it ioprio based isolation or soemthing else?

For me that's an open question.  ioprio might be a starting place.
There is interest in accounting for IO time, and ioprio doesn't
provide a notion of "tagging" IO by submitter.

Thanks for your interest.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux