Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Curt Wohlgemuth <curtw@xxxxxxxxxx> · Wed, 6 Apr 2011 07:49:25 -0700

On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
>> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
>> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>> > > > > There
>> > > > > is no context (memcg or otherwise) given to the bdi flusher.  After
>> > > > > the bdi flusher checks system-wide background limits, it uses the
>> > > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
>> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to
>> > > > > find inode pages to writeback.  Once the memcg dirty memory usage
>> > > > > drops below the memcg-thresh, the memcg is removed from the global
>> > > > > over_bg_limit list.
>> > > >
>> > > > If you want controlled hand-off of writeback, you need to pass the
>> > > > memcg that triggered the throttling directly to the bdi. You already
>> > > > know what both the bdi and memcg that need writeback are. Yes, this
>> > > > needs concurrency at the BDI flush level to handle, but see my
>> > > > previous email in this thread for that....
>> > > >
>> > >
>> > > Even with memcg being passed around I don't think that we get rid of
>> > > global list lock.
> .....
>> > > The reason being that inodes are not exclusive to
>> > > the memory cgroups. Multiple memory cgroups might be writting to same
>> > > inode. So inode still remains in the global list and memory cgroups
>> > > kind of will have pointer to it.
>> >
>> > So two dirty inode lists that have to be kept in sync? That doesn't
>> > sound particularly appealing. Nor does it scale to an inode being
>> > dirty in multiple cgroups
>> >
>> > Besides, if you've got multiple memory groups dirtying the same
>> > inode, then you cannot expect isolation between groups. I'd consider
>> > this a broken configuration in this case - how often does this
>> > actually happen, and what is the use case for supporting
>> > it?
>> >
>> > Besides, the implications are that we'd have to break up contiguous
>> > IOs in the writeback path simply because two sequential pages are
>> > associated with different groups. That's really nasty, and exactly
>> > the opposite of all the write combining we try to do throughout the
>> > writeback path. Supporting this is also a mess, as we'd have to touch
>> > quite a lot of filesystem code (i.e. .writepage(s) inplementations)
>> > to do this.
>>
>> We did not plan on breaking up contigous IO even if these belonged to
>> different cgroup for performance reason. So probably can live with some
>> inaccuracy and just trigger the writeback for one inode even if that
>> meant that it could writeback the pages of some other cgroups doing IO
>> on that inode.
>
> Which, to me, violates the principle of isolation as it's been
> described that this functionality is supposed to provide.
>
> It also means you will have handle the case of a cgroup over a
> throttle limit and no inodes on it's dirty list. It's not a case of
> "probably can live with" the resultant mess, the mess will occur and
> so handling it needs to be designed in from the start.
>
>> > > So to start writeback on an inode
>> > > you still shall have to take global lock, IIUC.
>> >
>> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
>> > in cgroup, and go from there? I mean, really all that cgroup-aware
>> > writeback needs is just adding a new container for managing
>> > dirty inodes in the writeback path and a method for selecting that
>> > container for writeback, right?
>>
>> This was the initial design where one inode is associated with one cgroup
>> even if process from multiple cgroups are doing IO to same inode. Then
>> somebody raised the concern that it probably is too coarse.
>
> Got a pointer?
>
>> IMHO, as a first step, associating inode to one cgroup exclusively
>> simplifies the things considerably and we can target that first.
>>
>> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
>> makes sense and is relatively simple way of doing things at the expense
>> of not being accurate for shared inode case.
>
> Can someone describe a valid shared inode use case? If not, we
> should not even consider it as a requirement and explicitly document
> it as a "not supported" use case.

At the very least, when a task is moved from one cgroup to another,
we've got a shared inode case.  This probably won't happen more than
once for most tasks, but it will likely be common.

Curt

>
> As it is, I'm hearing different ideas and requirements from the
> people working on the memcg side of this vs the IO controller side.
> Perhaps the first step is documenting a common set of functional
> requirements that demonstrates how everything will play well
> together?
>
> e.g. Defining what isolation means, when and if it can be violated,
> how violations are handled, when inodes in multiple memcgs are
> acceptable and how they need to be accounted and handled by the
> writepage path, how memcg's over the dirty threshold with no dirty
> inodes are to be handled, how metadata IO is going to be handled by
> IO controllers, what kswapd is going to do writeback when the pages
> it's trying to writeback during a critical low memory event belong
> to a cgroup that is throttled at the IO level, etc.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html