Re: [Lsf] IO less throttling and cgroup aware writeback

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 7 Apr 2011 09:36:02 +1000

On Wed, Apr 06, 2011 at 04:07:14PM -0700, Greg Thelen wrote:
> Vivek Goyal <vgoyal@xxxxxxxxxx> writes:
> 
> > On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
> >
> > [..]
> >> > Can someone describe a valid shared inode use case? If not, we
> >> > should not even consider it as a requirement and explicitly document
> >> > it as a "not supported" use case.
> >> 
> >> At the very least, when a task is moved from one cgroup to another,
> >> we've got a shared inode case.  This probably won't happen more than
> >> once for most tasks, but it will likely be common.
> >
> > I am hoping that for such cases sooner or later inode movement will
> > automatically take place. At some point of time, inode will be clean
> > and no more on memcg_bdi list. And when it is dirtied again, I am 
> > hoping it will be queued on new groups's list and not on old group's
> > list? Greg?
> >
> > Thanks
> > Vivek
> 
> After more thought, a few tweaks to the previous design have emerged.  I
> noted such differences with 'Clarification' below.
> 
> When an inode is marked dirty, current->memcg is used to determine
> which per memcg b_dirty list within the bdi is used to queue the
> inode.  When the inode is marked clean, then the inode is removed from
> the per memcg b_dirty list.  So, as Vivek said, when a process is
> migrated between memcg, then the previously dirtied inodes will not be
> moved.  Once such inodes are marked clean, and the re-dirtied, then
> they will be requeued to the correct per memcg dirty inode list.
> 
> Here's an overview of the approach, which is assumes inode sharing is
> rare but possible.  Thus, such sharing is tolerated (no live locks,
> etc) but not optimized.
> 
> bdi -> 1:N -> bdi_memcg -> 1:N -> inode
> 
> mark_inode_dirty(inode)
>    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
>    using current->memcg as a key to select the correct list.
>        This will require memory allocation of bdi_memcg, if this is the
>        first inode within the bdi,memcg.  If the allocation fails (rare,
>        but possible), then fallback to adding the memcg to the root
>        cgroup dirty inode list.
>    If I_DIRTY is already set, then do nothing.

This is where it gets tricky. Page cache dirtiness is tracked via
I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and
I_DIRTY_SYNC are for inode metadata changes, and a lot of
filesystems track those themselves. Indeed, XFS doesn't mark inodes
dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any
more, and there's no way that tracking can be made cgroup aware.

Hence it can be the case that only I_DIRTY_PAGES is tracked in
the VFS dirty lists, and that is the flag you need to care about
here.

Further, we are actually looking at formalising this - changing the
.dirty_inode() operation to take the dirty flags and return a result
that indicates whether the inode should be tracked in the VFS dirty
list at all. This would stop double tracking of dirty inodes and go
a long way to solving some of the behavioural issues we have now
(e.g. the VFS tracking and trying to writeback inodes that the
filesystem has already cleaned).

Hence I think you need to be explicit that this tracking is
specifically for I_DIRTY_PAGES state, though will handle other dirty
inode states if desired by the filesytem.

> When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> if the list is now empty.
> 
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
>    if over bg limit, then
>        set bdi_memcg->b_over_limit
>            If there is no bdi_memcg (because all inodes of currentâs
>            memcg dirty pages where first dirtied by other memcg) then
>            memcg lru to find inode and call writeback_single_inode().
>            This is to handle uncommon sharing.

We don't want to introduce any new IO sources into
balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
flusher writeback, not try to write back inodes itself.

Alternatively, this problem won't exist if you transfer page Ñache
state from one memcg to another when you move the inode from one
memcg to another.

>        reference memcg for bdi flusher
>        awake bdi flusher
>    if over fg limit
>        IO-full: write bdi_memcg directly (if empty use memcg lru to find
>        inode to write)
> 
>        Clarification: In IO-less: queue memcg-waiting description to bdi
>        flusher waiters (balance_list).

I'd be looking at designing for IO-less throttling up front....

> Clarification:
> wakeup_flusher_threads():
>   would take an optional memcg parameter, which would be included in the
>   created work item.
> 
>   try_to_free_pages() would pass in a memcg.  Other callers would pass
>   in NULL.
> 
> 
> bdi_flusher(bdi):
>     Clarification: When processing the bdi work queue, some work items
>     may include a memcg (see wakeup_flusher_threads above).  If present,
>     use the specified memcg to determine which bdi_memcg (and thus
>     b_dirty list) should be used.  If NULL, then all bdi_memcg would be
>     considered to process all inodes within the bdi.
> 
>    once work queue is empty:
>        wb_check_old_data_flush():
>            write old inodes from each of the per-memcg dirty lists.
> 
>        wb_check_background_flush():
>            if any of bdi_memcg->b_over_limit is set, then write
>            bdi_memcg->b_dirty inodes until under limit.
> 
>                After writing some data, recheck to see if memcg is still over
>                bg_thresh.  If under limit, then clear b_over_limit and release
>                memcg reference.
> 
>                If unable to bring memcg dirty usage below bg limit after
>                bdi_memcg->b_dirty is empty, release memcg reference and return.
>                Next time memcg calls balance_dirty_pages it will either select
>                another bdi or use lru to find an inode.

I think all the background flush cares about is bringing memcg's
under the dirty limit. What balance_dirty_pages() does is irrelevant
to the background flush.

>            use over_bground_thresh() to check global background limit.

the background flush needs to continue while over the global limit
even if all the memcg's are under their limits. In which case, we
need to consider if we need to be fair when writing back memcg's on
a bdi i.e. do we cycle an inode at a time until b_io is empty, then
cycle to the next memcg, and not come back to the first memcg with
inodes queued on b_more_io until they all have empty b_io queues?

> When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
> pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
> list will become empty and bdi_memcg will be deleted.

So you need to reference count the bdi_memcg structures?

> Too much code churn in writeback is not good.  So these memcg writeback
> enhancements should probably wait for IO-less dirty throttling to get
> worked out.

Agreed. We're probably looking at .41 or .42 for any memcg writeback
enhancements.

> These memcg messages are design level discussions to get me
> heading the right direction.  I plan on implementing memcg aware
> writeback in the background while IO-less balance_dirty_pages is worked
> out so I can follow it up.

Great!

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html