Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 6, 2011 at 8:39 AM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
>
> [..]
>> > Can someone describe a valid shared inode use case? If not, we
>> > should not even consider it as a requirement and explicitly document
>> > it as a "not supported" use case.
>>
>> At the very least, when a task is moved from one cgroup to another,
>> we've got a shared inode case.  This probably won't happen more than
>> once for most tasks, but it will likely be common.
>
> I am hoping that for such cases sooner or later inode movement will
> automatically take place. At some point of time, inode will be clean
> and no more on memcg_bdi list. And when it is dirtied again, I am
> hoping it will be queued on new groups's list and not on old group's
> list? Greg?
>
> Thanks
> Vivek

When an inode is marked dirty, current->memcg is used to determine
which per memcg b_dirty list within the bdi is used to queue the
inode.  When the inode is marked clean, then the inode is removed from
the per memcg b_dirty list.  So, as Vivek said, when a process is
migrated between memcg, then the previously dirtied inodes will not be
moved.  Once such inodes are marked clean, and the re-dirtied, then
they will be requeued to the correct per memcg dirty inode list.

Here's an overview of the approach, which is assumes inode sharing is
rare but possible.  Thus, such sharing is tolerated (no live locks,
etc) but not optimized.

bdi -> 1:N -> bdi_memcg -> 1:N -> inode

mark_inode_dirty(inode)
    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
    using current->memcg as a key to select the correct list.
        This will require memory allocation of bdi_memcg, if this is the first
        inode within the bdi,memcg.  If the allocation fails (rare,
but possible),
        then fallback to adding the memcg to the root cgroup dirty inode list.
    If I_DIRTY is already set, then do nothing.

When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
if the list is now empty.

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
    if over bg limit, then
        set bdi_memcg->b_over_limit
            If there is no bdi_memcg (because all inodes of current’s memcg
            dirty pages where first dirtied by other memcg) then memcg lru
            to find inode and call writeback_single_inode().  This is to handle
            uncommon sharing.
        reference memcg for bdi flusher
        awake bdi flusher
    if over fg limit
        IO-full: write bdi_memcg directly (if empty use memcg lru to
find inode to write)
        IO-less: queue memcg-waiting description to bdi flusher.

bdi_flusher(bdi):
    process work queue, which will not include any memcg flusher work - just
    like current code.

    once work queue is empty:
        wb_check_old_data_flush():
            write old inodes from each of the per-memcg dirty lists.

        wb_check_background_flush():
            if any of bdi_memcg->b_over_limit is set, then write
            bdi_memcg->b_dirty inodes until under limit.

                After writing some data, recheck to see if memcg is still over
                bg_thresh.  If under limit, then clear b_over_limit and release
                memcg reference.

                If unable to bring memcg dirty usage below bg limit after
                bdi_memcg->b_dirty is empty, release memcg reference and return.
                Next time memcg calls balance_dirty_pages it will either select
                another bdi or use lru to find an inode.

            use over_bground_thresh() to check global background limit.

When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
list will become empty and bdi_memcg will be deleted.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux