On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: >> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: >> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: >> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: >> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: >> > > > > There >> > > > > is no context (memcg or otherwise) given to the bdi flusher. After >> > > > > the bdi flusher checks system-wide background limits, it uses the >> > > > > over_bg_limit list to find (and rotate) an over limit memcg. Using >> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to >> > > > > find inode pages to writeback. Once the memcg dirty memory usage >> > > > > drops below the memcg-thresh, the memcg is removed from the global >> > > > > over_bg_limit list. >> > > > >> > > > If you want controlled hand-off of writeback, you need to pass the >> > > > memcg that triggered the throttling directly to the bdi. You already >> > > > know what both the bdi and memcg that need writeback are. Yes, this >> > > > needs concurrency at the BDI flush level to handle, but see my >> > > > previous email in this thread for that.... >> > > > >> > > >> > > Even with memcg being passed around I don't think that we get rid of >> > > global list lock. > ..... >> > > The reason being that inodes are not exclusive to >> > > the memory cgroups. Multiple memory cgroups might be writting to same >> > > inode. So inode still remains in the global list and memory cgroups >> > > kind of will have pointer to it. >> > >> > So two dirty inode lists that have to be kept in sync? That doesn't >> > sound particularly appealing. Nor does it scale to an inode being >> > dirty in multiple cgroups >> > >> > Besides, if you've got multiple memory groups dirtying the same >> > inode, then you cannot expect isolation between groups. I'd consider >> > this a broken configuration in this case - how often does this >> > actually happen, and what is the use case for supporting >> > it? >> > >> > Besides, the implications are that we'd have to break up contiguous >> > IOs in the writeback path simply because two sequential pages are >> > associated with different groups. That's really nasty, and exactly >> > the opposite of all the write combining we try to do throughout the >> > writeback path. Supporting this is also a mess, as we'd have to touch >> > quite a lot of filesystem code (i.e. .writepage(s) inplementations) >> > to do this. >> >> We did not plan on breaking up contigous IO even if these belonged to >> different cgroup for performance reason. So probably can live with some >> inaccuracy and just trigger the writeback for one inode even if that >> meant that it could writeback the pages of some other cgroups doing IO >> on that inode. > > Which, to me, violates the principle of isolation as it's been > described that this functionality is supposed to provide. > > It also means you will have handle the case of a cgroup over a > throttle limit and no inodes on it's dirty list. It's not a case of > "probably can live with" the resultant mess, the mess will occur and > so handling it needs to be designed in from the start. > >> > > So to start writeback on an inode >> > > you still shall have to take global lock, IIUC. >> > >> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes >> > in cgroup, and go from there? I mean, really all that cgroup-aware >> > writeback needs is just adding a new container for managing >> > dirty inodes in the writeback path and a method for selecting that >> > container for writeback, right? >> >> This was the initial design where one inode is associated with one cgroup >> even if process from multiple cgroups are doing IO to same inode. Then >> somebody raised the concern that it probably is too coarse. > > Got a pointer? > >> IMHO, as a first step, associating inode to one cgroup exclusively >> simplifies the things considerably and we can target that first. >> >> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes >> makes sense and is relatively simple way of doing things at the expense >> of not being accurate for shared inode case. > > Can someone describe a valid shared inode use case? If not, we > should not even consider it as a requirement and explicitly document > it as a "not supported" use case. At the very least, when a task is moved from one cgroup to another, we've got a shared inode case. This probably won't happen more than once for most tasks, but it will likely be common. Curt > > As it is, I'm hearing different ideas and requirements from the > people working on the memcg side of this vs the IO controller side. > Perhaps the first step is documenting a common set of functional > requirements that demonstrates how everything will play well > together? > > e.g. Defining what isolation means, when and if it can be violated, > how violations are handled, when inodes in multiple memcgs are > acceptable and how they need to be accounted and handled by the > writepage path, how memcg's over the dirty threshold with no dirty > inodes are to be handled, how metadata IO is going to be handled by > IO controllers, what kswapd is going to do writeback when the pages > it's trying to writeback during a critical low memory event belong > to a cgroup that is throttled at the IO level, etc. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html