Re: [PATCH RFC] fsio: filesystem io accounting cgroup

Vivek Goyal <vgoyal@xxxxxxxxxx> · Tue, 9 Jul 2013 10:54:30 -0400

On Tue, Jul 09, 2013 at 07:29:08AM -0700, Tejun Heo wrote:
> On Tue, Jul 09, 2013 at 10:18:33AM -0400, Vivek Goyal wrote:
> > For implementing throttling one as such does not have to do time
> > slice management on the queue.  For providing constructs like IOPS
> > or bandwidth throttling, one just need to put one throttling knob 
> > in the cgroup pipe irrespective of time slice management on the
> > backing device/network.
> 
> We should be providing a comprehensive mechanism to be used from
> userland, not something which serves pieces of specialized
> requirements here and there.  blkio is already a mess with the
> capability changing depending on which elevator is in use and
> blk-throttle counting bios instead of merged requests making iops
> control a bit silly.  We need to clean that up, not add more mess on
> top.

It is not clear whether counting bio or counting request is right
thing to do here. It depends where you are trying to throttle. For
bio based drivers there is request and they need throttling mechanism
too. So keeping it common for both, kind of makes sense.

> 
> > Also time slice management is one way of managing the backend resource.
> > CFQ did that and it works only for slow devices. For faster devices
> > we anyway need some kind of token mechanism instead of keeping track
> > of time.
> 
> No, it is the *right* resource to manage for rotating devices if you
> want any sort of meaningful proportional resource distribution.  It's
> not something one dreams up out of blue but something which arises
> from the fundamental operating characteristics of the device.  For
> SSDs, iops is good enough as their latency profile is consistent
> enough but doing so with rotating disks doesn't yield anything useful.

Ok, so first of all you agree that time slice management is not a
requirement for fast devices.

Secondly, even for slow devices, time slice management practically works
only if NCQ is not implemented in device or NCQ is not being used because
CFQ is not dispatching more requests.

So even in CFQ, time slice accounting works only for sequential IO.
Anybody doing random IO, there is no notion of time slice. We allow
dispatching requests from multiple queues at the same time and then
we don't have a way to count time.

So time slice management is a problem even on slow devices which implement
NCQ. IIRC, in the beginning even CFQ as doing some kind of request
management (and not time slice management). And later it switched to
time slice management in an effort to provide better fairness (If somebody
is doing random IO and seek takes more time the process should be
accounted for it).

But ideal time slice accounting requires driving a queue depth of 1
and for any non-sequential IO, it kills performance.

> 
> > So I don't think trying to manage time slice is the requirement here.
> 
> For a cgroup resource controller, it *is* a frigging requirement to
> control the right fundamental resource at the right place where the
> resource resides and can be fully controlled.  Nobody should have any
> other impression.

Seriously, time slice accounting is one way of managing resource. Same
disk resource can be divided proportionally by counting either iops
or by counting amount of IO done (bandwidth).

If we count iops or bandwidth, it might not be most fair way of doing
things on rotational media but it also should provide more accurate
results in case of NCQ. When multiple requests have been dispatched
to disk we have no idea which request consumed how much of disk time.
So there is no way to account it properly. Iops or bandwidth based
accounting will work just fine even with NCQ.

> 
> > > and by the time you implemented proper hierarchy support and
> > > proportional contnrol, yours isn't gonna be that simple either.
> > 
> > I suspect he is not plannnig to do any proportional control at that
> > layer. Just throttling mechanism.
> 
> blkio should be able to do proportional control in general.  The fact
> that we aren't able to do that except when cfq-iosched is in use is a
> problem which needs to be fixed.  It's not a free-for-all pass for
> creating more broken stuff.

So you want this generic block layer proportional implementation to
do time slice management?

I thought we talked about this implementation to use some kind of token
based mechanism so that it scales better on faster devices. And on slower
devices one will continue to use CFQ.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>