Hi, Andrea, > >> Ok, I will give more details of the thought process. > >> > >> I was thinking of maintaing an rb-tree per request queue and not an > >> rb-tree per cgroup. This tree can contain all the bios submitted to that > >> request queue through __make_request(). Every node in the tree will represent > >> one cgroup and will contain a list of bios issued from the tasks from that > >> cgroup. > >> > >> Every bio entering the request queue through __make_request() function > >> first will be queued in one of the nodes in this rb-tree, depending on which > >> cgroup that bio belongs to. > >> > >> Once the bios are buffered in rb-tree, we release these to underlying > >> elevator depending on the proportionate weight of the nodes/cgroups. > >> > >> Some more details which I was trying to implement yesterday. > >> > >> There will be one bio_cgroup object per cgroup. This object will contain > >> many bio_group objects. Each bio_group object will be created for each > >> request queue where a bio from bio_cgroup is queued. Essentially the idea > >> is that bios belonging to a cgroup can be on various request queues in the > >> system. So a single object can not serve the purpose as it can not be on > >> many rb-trees at the same time. Hence create one sub object which will keep > >> track of bios belonging to one cgroup on a particular request queue. > >> > >> Each bio_group will contain a list of bios and this bio_group object will > >> be a node in the rb-tree of request queue. For example. Lets say there are > >> two request queues in the system q1 and q2 (lets say they belong to /dev/sda > >> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both > >> for /dev/sda and /dev/sdb. > >> > >> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group > >> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree > >> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of > >> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of > >> bios issued by task t1 for /dev/sdb. I thought the same can be extended > >> for stacked devices also. > >> > >> I am still trying to implementing it and hopefully this is doable idea. > >> I think at the end of the day it will be something very close to dm-ioband > >> algorithm just that there will be no lvm driver and no notion of separate > >> dm-ioband device. > > > > Vivek, thanks for the detailed explanation. Only a comment. I guess, if > > we don't change also the per-process optimizations/improvements made by > > some IO scheduler, I think we can have undesirable behaviours. > > > > For example: CFQ uses the per-process iocontext to improve fairness > > between *all* the processes in a system. But it doesn't have the concept > > that there's a cgroup context on-top-of the processes. > > > > So, some optimizations made to guarantee fairness among processes could > > conflict with algorithms implemented at the cgroup layer. And > > potentially lead to undesirable behaviours. > > > > For example an issue I'm experiencing with my cgroup-io-throttle > > patchset is that a cgroup can consistently increase the IO rate (always > > respecting the max limits), simply increasing the number of IO worker > > tasks respect to another cgroup with a lower number of IO workers. This > > is probably due to the fact the CFQ tries to give the same amount of > > "IO time" to all the tasks, without considering that they're organized > > in cgroup. > > BTW this is why I proposed to use a single shared iocontext for all the > processes running in the same cgroup. Anyway, this is not the best > solution, because in this way all the IO requests coming from a cgroup > will be queued to the same cfq queue. If I'm not wrong in this way we > would implement noop (FIFO) between tasks belonging to the same cgroup > and CFQ between cgroups. But, at least for this particular case, we > would be able to provide fairness among cgroups. > > -Andrea I ever thought the same thing but this approach breaks the compatibility. I think we should make ionice only effective for the processes in the same cgroup. A system gives some amount of bandwidths to its cgroups, and the processes in one of the cgroups fairly share the given bandwidth. I think this is the straight approach. What do you think? I think all the CFQ-cgroup the NEC guys are working, OpenVZ team's CFQ scheduler and dm-ioband with bio-cgroup work like this. Thank you, Hirokazu Takahashi. -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel