Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Greg Thelen <gthelen@xxxxxxxxxx> · Sat, 2 Apr 2011 00:34:32 -0700

On Sat, Apr 2, 2011 at 12:33 AM, Greg Thelen <gthelen@xxxxxxxxxx> wrote:
> On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>>> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>>> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>>> > > There
>>> > > is no context (memcg or otherwise) given to the bdi flusher.  After
>>> > > the bdi flusher checks system-wide background limits, it uses the
>>> > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
>>> > > the memcg, then the per memcg per bdi dirty inode list is walked to
>>> > > find inode pages to writeback.  Once the memcg dirty memory usage
>>> > > drops below the memcg-thresh, the memcg is removed from the global
>>> > > over_bg_limit list.
>>> >
>>> > If you want controlled hand-off of writeback, you need to pass the
>>> > memcg that triggered the throttling directly to the bdi. You already
>>> > know what both the bdi and memcg that need writeback are. Yes, this
>>> > needs concurrency at the BDI flush level to handle, but see my
>>> > previous email in this thread for that....
>>> >
>>>
>>> Even with memcg being passed around I don't think that we get rid of
>>> global list lock.
>>
>> You need to - we're getting rid of global lists and locks from
>> writeback for scalability reasons so any new functionality needs to
>> avoid global locks for the same reason.
>>
>>> The reason being that inodes are not exclusive to
>>> the memory cgroups. Multiple memory cgroups might be writting to same
>>> inode. So inode still remains in the global list and memory cgroups
>>> kind of will have pointer to it.
>>
>> So two dirty inode lists that have to be kept in sync? That doesn't
>> sound particularly appealing. Nor does it scale to an inode being
>> dirty in multiple cgroups
>>
>> Besides, if you've got multiple memory groups dirtying the same
>> inode, then you cannot expect isolation between groups. I'd consider
>> this a broken configuration in this case - how often does this
>> actually happen, and what is the use case for supporting
>> it?
>>
>> Besides, the implications are that we'd have to break up contiguous
>> IOs in the writeback path simply because two sequential pages are
>> associated with different groups. That's really nasty, and exactly
>> the opposite of all the write combining we try to do throughout the
>> writeback path. Supporting this is also a mess, as we'd have to touch
>> quite a lot of filesystem code (i.e. .writepage(s) inplementations)
>> to do this.
>>
>>> So to start writeback on an inode
>>> you still shall have to take global lock, IIUC.
>>
>> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
>> in cgroup, and go from there? I mean, really all that cgroup-aware
>> writeback needs is just adding a new container for managing
>> dirty inodes in the writeback path and a method for selecting that
>> container for writeback, right?
>
> I feel compelled to optimize for multiple cgroup's concurrently

Correction: I do NOT feel compelled to optimizing for sharing...

> dirtying an inode.  I see sharing as legitimate if a file is handed
> off between jobs (cgroups).  But I do not see concurrent writing as a
> common use case.
> If anyone else feels this is a requirement, please speak up.
> However, I would like the system tolerate sharing, though it does not
> have to do so in optimal fashion.
> Here are two approaches that do not optimize for sharing.  Though,
> each approach tries to tolerate sharing without falling over.
>
> Approach 1 (inspired from Dave's comments):
>
> bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode
>
> * when setting I_DIRTY in a memcg, insert inode into
> bdi_memcg_dirty_inodes rather than b_dirty.
>
> * when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode
>
> * balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi)
> if over bg limit, then queue memcg writeback to bdi flusher.
> if over fg limit, then queue memcg-waiting description to bdi flusher
> (IO less throttle).
>
> * bdi_flusher(bdi):
> using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list.
> “Some” is for fairness.
>
> if bdi flusher is unable to bring memcg dirty usage below bg limit
> after bdi_memcg_dirty_inodes list is empty, then need to do
> “something” to make forward progress.  This could be caused by either
> (a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying
> inodes previously dirtied by another memcg therefore the first
> dirtying memcg is the one that will write it back.
>
> Case A) If memcg dirties multiple bdi and then hits memcg bg limit,
> queue bg writeback for the bdi being written to.  This may not
> writeback other useful bdi.  System-wide background limit has similar
> issue.  Could link bdi_memcg together and wakeup peer bdi.  For now,
> defer the problem.
>
> Case B) Dirtying another cgroup’s dirty inode.  While is not a common
> use case, it could happen.  Options to avoid lockup:
>
> + When an inode becomes dirty shared, then move the inode from the per
> bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi
> wide b_unknown_memcg_dirty (promiscuous inode) list.
> b_unknown_memcg_dirty is written when memcg writeback is invoked to
> the bdi.  When an inode is cleaned and later redirtied it is added to
> the normal bdi_memcg_dirty_inode_list.
>
> + Considered: when file page goes dirty, then do not account the dirty
> page to the memcg where the page was charged, instead recharge the
> page to the memcg that the inode was billed to (by inode i_memcg
> field).  Inode would require a memcg reference that would make memcg
> cleanup tricky.
>
> + Scan memcg lru for dirty file pages -> associated inodes -> bdi ->
> writeback(bdi, inode)
>
> + What if memcg dirty limits are simply ignored in case-B?
> Ineffective memcg background writeback would be queued as usage grows.
>  Once memcg foreground limit is hit, then it would throttle waiting
> for the ineffective background writeback to never catch up.  This
> could wait indefinitely.  Could argue that the hung cgroup deserves
> this for writing to another cgroup’s inode.  However, the other cgroup
> could be the trouble maker who sneaks in to dirty the file and assume
> dirty ownership before the innocent (now hung) cgroup starts writing.
> I am not worried about making this optimal, just making forward
> progress.  Fallback to scanning memcg lru looking for inode’s of dirty
> pages.  This may be expensive, but should only happen with dirty
> inodes shared between memcg.
>
>
> Approach 2 : do something even simpler:
>
> http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359
>
> * __set_page_dirty()
>
> either set i_memcg=memcg or i_memcg=~0
> no memcg reference needed, i_memcg is not dereferenced
>
> * mem_cgroup_balance_dirty_pages(memcg, bdi)
>
> if over bg limit, then queue memcg to bdi for background writeback
> if over fg limit, then queue memcg-waiting description to bdi flusher
> (IO less throttle)
>
> * bdi_flusher(bdi)
>
> if doing memcg writeback, scan b_dirty filtering using
> is_memcg_inode(inode,memcg), which checks i_memcg field: return
> i_memcg in [~0, memcg]
>
> if unable to get memcg below its dirty memory limit:
>
> + If memcg dirties multiple bdi and then hits memcg bg limit, queue bg
> writeback for the bdi being written to.  This may not writeback other
> useful bdi.  System-wide background limit has similar issue.
>
> - con: If degree of sharing exceeds compile time max supported sharing
> degree (likely 1), then ANY writeback (per-memcg or system-wide) will
> writeback the over-shared inode.  This is undesirable because it
> punishes innocent cgroups that are not abusively sharing.
>
> - con: have to scan entire b_dirty list which may involve skipping
> many inodes not in over-limit cgroup.  A memcg constantly hitting its
> limit would monopolize a bdi flusher.
>
>
> Both approaches are complicated by the (rare) possibility when an
> inode has been been claimed (from a dirtying memcg perspective) by
> memcg M1 but later M2 writes more dirty pages.  When M2 exceeds its
> dirty limit it would be nice to find the inode, even if this requires
> some extra work.
>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@xxxxxxxxxxxxx
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html