Re: [Lsf] IO less throttling and cgroup aware writeback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Vivek Goyal <vgoyal@xxxxxxxxxx> writes:

> On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
>
> [..]
>> > Can someone describe a valid shared inode use case? If not, we
>> > should not even consider it as a requirement and explicitly document
>> > it as a "not supported" use case.
>> 
>> At the very least, when a task is moved from one cgroup to another,
>> we've got a shared inode case.  This probably won't happen more than
>> once for most tasks, but it will likely be common.
>
> I am hoping that for such cases sooner or later inode movement will
> automatically take place. At some point of time, inode will be clean
> and no more on memcg_bdi list. And when it is dirtied again, I am 
> hoping it will be queued on new groups's list and not on old group's
> list? Greg?
>
> Thanks
> Vivek

After more thought, a few tweaks to the previous design have emerged.  I
noted such differences with 'Clarification' below.

When an inode is marked dirty, current->memcg is used to determine
which per memcg b_dirty list within the bdi is used to queue the
inode.  When the inode is marked clean, then the inode is removed from
the per memcg b_dirty list.  So, as Vivek said, when a process is
migrated between memcg, then the previously dirtied inodes will not be
moved.  Once such inodes are marked clean, and the re-dirtied, then
they will be requeued to the correct per memcg dirty inode list.

Here's an overview of the approach, which is assumes inode sharing is
rare but possible.  Thus, such sharing is tolerated (no live locks,
etc) but not optimized.

bdi -> 1:N -> bdi_memcg -> 1:N -> inode

mark_inode_dirty(inode)
   If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
   using current->memcg as a key to select the correct list.
       This will require memory allocation of bdi_memcg, if this is the
       first inode within the bdi,memcg.  If the allocation fails (rare,
       but possible), then fallback to adding the memcg to the root
       cgroup dirty inode list.
   If I_DIRTY is already set, then do nothing.

When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
if the list is now empty.

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
   if over bg limit, then
       set bdi_memcg->b_over_limit
           If there is no bdi_memcg (because all inodes of currentâs
           memcg dirty pages where first dirtied by other memcg) then
           memcg lru to find inode and call writeback_single_inode().
           This is to handle uncommon sharing.
       reference memcg for bdi flusher
       awake bdi flusher
   if over fg limit
       IO-full: write bdi_memcg directly (if empty use memcg lru to find
       inode to write)

       Clarification: In IO-less: queue memcg-waiting description to bdi
       flusher waiters (balance_list).

Clarification:
wakeup_flusher_threads():
  would take an optional memcg parameter, which would be included in the
  created work item.

  try_to_free_pages() would pass in a memcg.  Other callers would pass
  in NULL.


bdi_flusher(bdi):
    Clarification: When processing the bdi work queue, some work items
    may include a memcg (see wakeup_flusher_threads above).  If present,
    use the specified memcg to determine which bdi_memcg (and thus
    b_dirty list) should be used.  If NULL, then all bdi_memcg would be
    considered to process all inodes within the bdi.

   once work queue is empty:
       wb_check_old_data_flush():
           write old inodes from each of the per-memcg dirty lists.

       wb_check_background_flush():
           if any of bdi_memcg->b_over_limit is set, then write
           bdi_memcg->b_dirty inodes until under limit.

               After writing some data, recheck to see if memcg is still over
               bg_thresh.  If under limit, then clear b_over_limit and release
               memcg reference.

               If unable to bring memcg dirty usage below bg limit after
               bdi_memcg->b_dirty is empty, release memcg reference and return.
               Next time memcg calls balance_dirty_pages it will either select
               another bdi or use lru to find an inode.

           use over_bground_thresh() to check global background limit.

When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
list will become empty and bdi_memcg will be deleted.


Too much code churn in writeback is not good.  So these memcg writeback
enhancements should probably wait for IO-less dirty throttling to get
worked out.  These memcg messages are design level discussions to get me
heading the right direction.  I plan on implementing memcg aware
writeback in the background while IO-less balance_dirty_pages is worked
out so I can follow it up.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux