Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Greg Thelen <gthelen@xxxxxxxxxx> · Sat, 2 Apr 2011 00:33:06 -0700

On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>> > > There
>> > > is no context (memcg or otherwise) given to the bdi flusher.  After
>> > > the bdi flusher checks system-wide background limits, it uses the
>> > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
>> > > the memcg, then the per memcg per bdi dirty inode list is walked to
>> > > find inode pages to writeback.  Once the memcg dirty memory usage
>> > > drops below the memcg-thresh, the memcg is removed from the global
>> > > over_bg_limit list.
>> >
>> > If you want controlled hand-off of writeback, you need to pass the
>> > memcg that triggered the throttling directly to the bdi. You already
>> > know what both the bdi and memcg that need writeback are. Yes, this
>> > needs concurrency at the BDI flush level to handle, but see my
>> > previous email in this thread for that....
>> >
>>
>> Even with memcg being passed around I don't think that we get rid of
>> global list lock.
>
> You need to - we're getting rid of global lists and locks from
> writeback for scalability reasons so any new functionality needs to
> avoid global locks for the same reason.
>
>> The reason being that inodes are not exclusive to
>> the memory cgroups. Multiple memory cgroups might be writting to same
>> inode. So inode still remains in the global list and memory cgroups
>> kind of will have pointer to it.
>
> So two dirty inode lists that have to be kept in sync? That doesn't
> sound particularly appealing. Nor does it scale to an inode being
> dirty in multiple cgroups
>
> Besides, if you've got multiple memory groups dirtying the same
> inode, then you cannot expect isolation between groups. I'd consider
> this a broken configuration in this case - how often does this
> actually happen, and what is the use case for supporting
> it?
>
> Besides, the implications are that we'd have to break up contiguous
> IOs in the writeback path simply because two sequential pages are
> associated with different groups. That's really nasty, and exactly
> the opposite of all the write combining we try to do throughout the
> writeback path. Supporting this is also a mess, as we'd have to touch
> quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> to do this.
>
>> So to start writeback on an inode
>> you still shall have to take global lock, IIUC.
>
> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> in cgroup, and go from there? I mean, really all that cgroup-aware
> writeback needs is just adding a new container for managing
> dirty inodes in the writeback path and a method for selecting that
> container for writeback, right?

I feel compelled to optimize for multiple cgroup's concurrently
dirtying an inode.  I see sharing as legitimate if a file is handed
off between jobs (cgroups).  But I do not see concurrent writing as a
common use case.
If anyone else feels this is a requirement, please speak up.
However, I would like the system tolerate sharing, though it does not
have to do so in optimal fashion.
Here are two approaches that do not optimize for sharing.  Though,
each approach tries to tolerate sharing without falling over.

Approach 1 (inspired from Dave's comments):

bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode

* when setting I_DIRTY in a memcg, insert inode into
bdi_memcg_dirty_inodes rather than b_dirty.

* when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode

* balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi)
if over bg limit, then queue memcg writeback to bdi flusher.
if over fg limit, then queue memcg-waiting description to bdi flusher
(IO less throttle).

* bdi_flusher(bdi):
using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list.
“Some” is for fairness.

if bdi flusher is unable to bring memcg dirty usage below bg limit
after bdi_memcg_dirty_inodes list is empty, then need to do
“something” to make forward progress.  This could be caused by either
(a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying
inodes previously dirtied by another memcg therefore the first
dirtying memcg is the one that will write it back.

Case A) If memcg dirties multiple bdi and then hits memcg bg limit,
queue bg writeback for the bdi being written to.  This may not
writeback other useful bdi.  System-wide background limit has similar
issue.  Could link bdi_memcg together and wakeup peer bdi.  For now,
defer the problem.

Case B) Dirtying another cgroup’s dirty inode.  While is not a common
use case, it could happen.  Options to avoid lockup:

+ When an inode becomes dirty shared, then move the inode from the per
bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi
wide b_unknown_memcg_dirty (promiscuous inode) list.
b_unknown_memcg_dirty is written when memcg writeback is invoked to
the bdi.  When an inode is cleaned and later redirtied it is added to
the normal bdi_memcg_dirty_inode_list.

+ Considered: when file page goes dirty, then do not account the dirty
page to the memcg where the page was charged, instead recharge the
page to the memcg that the inode was billed to (by inode i_memcg
field).  Inode would require a memcg reference that would make memcg
cleanup tricky.

+ Scan memcg lru for dirty file pages -> associated inodes -> bdi ->
writeback(bdi, inode)

+ What if memcg dirty limits are simply ignored in case-B?
Ineffective memcg background writeback would be queued as usage grows.
 Once memcg foreground limit is hit, then it would throttle waiting
for the ineffective background writeback to never catch up.  This
could wait indefinitely.  Could argue that the hung cgroup deserves
this for writing to another cgroup’s inode.  However, the other cgroup
could be the trouble maker who sneaks in to dirty the file and assume
dirty ownership before the innocent (now hung) cgroup starts writing.
I am not worried about making this optimal, just making forward
progress.  Fallback to scanning memcg lru looking for inode’s of dirty
pages.  This may be expensive, but should only happen with dirty
inodes shared between memcg.

Approach 2 : do something even simpler:

http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359

* __set_page_dirty()

either set i_memcg=memcg or i_memcg=~0
no memcg reference needed, i_memcg is not dereferenced

* mem_cgroup_balance_dirty_pages(memcg, bdi)

if over bg limit, then queue memcg to bdi for background writeback
if over fg limit, then queue memcg-waiting description to bdi flusher
(IO less throttle)

* bdi_flusher(bdi)

if doing memcg writeback, scan b_dirty filtering using
is_memcg_inode(inode,memcg), which checks i_memcg field: return
i_memcg in [~0, memcg]

if unable to get memcg below its dirty memory limit:

+ If memcg dirties multiple bdi and then hits memcg bg limit, queue bg
writeback for the bdi being written to.  This may not writeback other
useful bdi.  System-wide background limit has similar issue.

- con: If degree of sharing exceeds compile time max supported sharing
degree (likely 1), then ANY writeback (per-memcg or system-wide) will
writeback the over-shared inode.  This is undesirable because it
punishes innocent cgroups that are not abusively sharing.

- con: have to scan entire b_dirty list which may involve skipping
many inodes not in over-limit cgroup.  A memcg constantly hitting its
limit would monopolize a bdi flusher.

Both approaches are complicated by the (rare) possibility when an
inode has been been claimed (from a dirtying memcg perspective) by
memcg M1 but later M2 writes more dirty pages.  When M2 exceeds its
dirty limit it would be nice to find the inode, even if this requires
some extra work.

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html