Re: [Lsf] IO less throttling and cgroup aware writeback

Vivek Goyal <vgoyal@xxxxxxxxxx> · Thu, 7 Apr 2011 15:24:24 -0400

On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:

[..]
> > mark_inode_dirty(inode)
> >    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
> >    using current->memcg as a key to select the correct list.
> >        This will require memory allocation of bdi_memcg, if this is the
> >        first inode within the bdi,memcg.  If the allocation fails (rare,
> >        but possible), then fallback to adding the memcg to the root
> >        cgroup dirty inode list.
> >    If I_DIRTY is already set, then do nothing.
> 
> This is where it gets tricky. Page cache dirtiness is tracked via
> I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and
> I_DIRTY_SYNC are for inode metadata changes, and a lot of
> filesystems track those themselves. Indeed, XFS doesn't mark inodes
> dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any
> more, and there's no way that tracking can be made cgroup aware.
> 
> Hence it can be the case that only I_DIRTY_PAGES is tracked in
> the VFS dirty lists, and that is the flag you need to care about
> here.
> 
> Further, we are actually looking at formalising this - changing the
> .dirty_inode() operation to take the dirty flags and return a result
> that indicates whether the inode should be tracked in the VFS dirty
> list at all. This would stop double tracking of dirty inodes and go
> a long way to solving some of the behavioural issues we have now
> (e.g. the VFS tracking and trying to writeback inodes that the
> filesystem has already cleaned).
> 
> Hence I think you need to be explicit that this tracking is
> specifically for I_DIRTY_PAGES state, though will handle other dirty
> inode states if desired by the filesytem.

Ok, that makes sense. We are interested primarily in I_DIRTY_PAGES state
only.

IIUC, so first we need to fix existing code where we seem to be moving
any inode on bdi writeback list based on I_DIRTY flag.

BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To
me both seem to mean that data needs to be written back and not the
inode itself.

> 
> > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> > if the list is now empty.
> > 
> > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> >    if over bg limit, then
> >        set bdi_memcg->b_over_limit
> >            If there is no bdi_memcg (because all inodes of currentâs
> >            memcg dirty pages where first dirtied by other memcg) then
> >            memcg lru to find inode and call writeback_single_inode().
> >            This is to handle uncommon sharing.
> 
> We don't want to introduce any new IO sources into
> balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> flusher writeback, not try to write back inodes itself.

Will we not enjoy more sequtial IO traffic once we find an inode by
traversing memcg->lru list? So isn't that better than pure LRU based
flushing?

> 
> Alternatively, this problem won't exist if you transfer page Ñache
> state from one memcg to another when you move the inode from one
> memcg to another.

But in case of shared inode problem still remains. inode is being written
from two cgroups and it can't be in both the groups as per the exisiting
design.

> 
> >        reference memcg for bdi flusher
> >        awake bdi flusher
> >    if over fg limit
> >        IO-full: write bdi_memcg directly (if empty use memcg lru to find
> >        inode to write)
> > 
> >        Clarification: In IO-less: queue memcg-waiting description to bdi
> >        flusher waiters (balance_list).
> 
> I'd be looking at designing for IO-less throttling up front....

Agreed. Lets design it on top of IO less throttling patches. We also 
shall have to modify IO less throttling a bit so that page completions
are not uniformly distributed across all the threads but we need to
account for groups first and then distribute completions with-in group
uniformly.

[..]
> >            use over_bground_thresh() to check global background limit.
> 
> the background flush needs to continue while over the global limit
> even if all the memcg's are under their limits. In which case, we
> need to consider if we need to be fair when writing back memcg's on
> a bdi i.e. do we cycle an inode at a time until b_io is empty, then
> cycle to the next memcg, and not come back to the first memcg with
> inodes queued on b_more_io until they all have empty b_io queues?
> 

I think continue to cycle through memcg's even in this case will make 
sense.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html