Re: [LSF/FS TOPIC] I/O performance isolation for shared storage

Vivek Goyal <vgoyal@xxxxxxxxxx> · Mon, 7 Feb 2011 13:06:47 -0500

On Fri, Feb 04, 2011 at 03:07:15PM -0800, Chad Talbott wrote:
> On Thu, Feb 3, 2011 at 6:31 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> > On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote:
> > This is definitely of interest to me (though I will not be around but
> > will like to read LWN summary of discussion later. :-)). Would like to
> > know more how google has deployed this and using this infrastructre. Also
> > would like that all the missing pieces be pushed upstream (especially
> > the buffered WRITE support and page tracking stuff).
> 
> Pushing this all upstream is my current focus, so you'll be getting a
> lot of patches in your inbox in the coming weeks.
> 
> > One thing I am curious to know that how do you deal with getting service
> > differentiation while maintaining high throughput. Idling on group for
> > fairness is more or less reasonable on single SATA disk but can very
> > well kill performance (especially with random IO) on storage array or
> > on fast SSDs.
> 
> We've sidestepped that problem by deploying the blk-cgroup scheduler
> against single spinning media drives.  Fast SSDs present another set
> of problems.  It's not clear to me that CFQ is the right starting
> place for a scheduler for SSDs.  Much of the structure of the code
> reflects its design for spinning media drives.
> 
> > I have been thinking of disabling idling altogether and trying to change the
> > position of group in the service tree based on weight when new IO comes
> > (CFQ already does something similar for cfqq, slice_offset() logic). I
> > have been thinking of doing similar while calculating vdisktime of group
> > when it gets enqueued. This might give us some service differentiation
> > while getting better throughput.
> 
> I'd like to hear more about this.

If a group dispatches some IO and then it is empty, then it will be
deleted from service tree and when new IO comes in, it will be put
at the end of service tree. That way all the groups become more of
round robin and there is no service differentiation.

I was thiking that when a group gets backlogged instead of putting
him at the end of service tree come up with a new mechanism of 
where they are put at certain offset from the st->min_vdisktime. This
offset is more for high prio group and less for low prio group. That
way even if a group gets deleted and comes back again with more IO
there is a chance it gets schedled ahead of already queued low prio
group and we could see some service differentiation even with idling
disabled.

But this is theory at this point and efficacy of this procedure will
be go down as we increase queue depth and service differentation will
also become underministic. But this might be our best bet on faster
devices with higher queue depth.

>It's not clear to me that idling
> would be necessary for throughput on a device with a deep queue.  In
> my mind idling is used only to get better throughput by avoiding seeks
> introduced when switching between synchronous tasks.
> 
> > You also mentioned about controlling latencies very tightly and that
> > probably means driving shallower queue depths (may be 1) so that preemption
> > is somewhat effective and latencies are better. But again driving lesser queue
> > depth can lead to reduced performance. So I am curious how do you deal with
> > that.
> 
> We've currently just made the trade-off that you're pointing out.
> We've chosen to limit queue depth and then leaned heavily on idling
> for sequential, synchronous, well-behaved applications to maintain
> throughput.  I think supporting high throughput and low-latency with
> many random workloads is still an open area.
> 
> > Also curious to know if per memory cgroup dirty ration stuff got in and how
> > did we deal with the issue of selecting which inode to dispatch the writes
> > from based on the cgroup it belongs to.
> 
> We have some experience with per-cgroup writeback under our fake-NUMA
> memory container system. Writeback under memcg will likely face
> similar issues.  See Greg Thelen's topic description at
> http://article.gmane.org/gmane.linux.kernel.mm/58164 for a request for
> discussion.
> 
> Per-cgroup dirty ratios is just the beginning, as you mention.  Unless
> the IO scheduler can see the deep queues of all the blocked tasks, it
> can't make the right decisions.  Also, today writeback is ignorant of
> the tasks' debt to the IO scheduler, so it issues the "wrong" inodes.
> 
> >> There is further work to do along the lines of fine-grained accounting
> >> and isolation. For example, many file servers in a Google cluster will
> >> do IO on behalf of hundreds, even thousands of clients. Each client
> >> has different service requirements, and it's inefficient to map them
> >> to (cgroup, task) pairs.
> >
> > So is it ioprio based isolation or soemthing else?
> 
> For me that's an open question.  ioprio might be a starting place.

[..]
> There is interest in accounting for IO time, and ioprio doesn't
> provide a notion of "tagging" IO by submitter.

I am curious to know how IO time can be accounted with deep queue depths
as once say 32 or more requests are in driver/device, we just don't
know which request consumed how much of actual time.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html