Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 18 Mar 2011 00:57:09 -0700
Greg Thelen <gthelen@xxxxxxxxxx> wrote:

> On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
> >> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
> >> Vivek have had some discussion around how memcg and writeback mesh.
> >> In my mind, the
> >> discussions in 8/9 are starting to blend with this thread.
> >>
> >> I have been thinking about Johannes' struct memcg_mapping. ÂI think the idea
> >> may address several of the issues being discussed, especially
> >> interaction between
> >> IO-less balance_dirty_pages() and memcg writeback.
> >>
> >> Here is my thinking. ÂFeedback is most welcome!
> >>
> >> The data structures:
> >> - struct memcg_mapping {
> >> Â Â Â Âstruct address_space *mapping;
> >> Â Â Â Âstruct mem_cgroup *memcg;
> >> Â Â Â Âint refcnt;
> >> Â };
> >> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> >> - each memcg_bdi contains a mapping from inode to memcg_mapping. ÂThis may be a
> >> Â very large set representing many cached inodes.
> >> - each memcg_mapping represents all pages within an bdi,inode,memcg. ÂAll
> >> Â corresponding cached inode pages point to the same memcg_mapping via
> >> Â pc->mapping. ÂI assume that all pages of inode belong to no more than one bdi.
> >> - manage a global list of memcg that are over their respective background dirty
> >> Â limit.
> >> - i_mapping continues to point to a traditional non-memcg mapping (no change
> >> Â here).
> >> - none of these memcg_* structures affect root cgroup or kernels with memcg
> >> Â configured out.
> >
> > So structures roughly like this:
> >
> > struct mem_cgroup {
> > Â Â Â Â...
> > Â Â Â Â/* key is struct backing_dev_info * */
> > Â Â Â Âstruct rb_root memcg_bdis;
> > };
> >
> > struct memcg_bdi {
> > Â Â Â Â/* key is struct address_space * */
> > Â Â Â Âstruct rb_root memcg_mappings;
> > Â Â Â Âstruct rb_node node;
> > };
> >
> > struct memcg_mapping {
> > Â Â Â Âstruct address_space *mapping;
> > Â Â Â Âstruct mem_cgroup *memcg;
> > Â Â Â Âstruct rb_node node;
> > Â Â Â Âatomic_t count;
> > };
> >
> > struct page_cgroup {
> > Â Â Â Â...
> > Â Â Â Âstruct memcg_mapping *memcg_mapping;
> > };
> >
> >> The routines under discussion:
> >> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> >> Â to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> >> Â levels in data structure.
> >>
> >> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> >> Â memcg. ÂIf refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> >> Â Also delete memcg_bdi if last memcg_mapping is removed.
> >>
> >> - account_page_dirtied(): nothing new here, continue to set the per-page flags
> >> Â and increment the memcg per-cpu dirty page counter. ÂSame goes for routines
> >> Â that mark pages in writeback and clean states.
> >
> > We may want to remember the dirty memcg_mappings so that on writeback
> > we don't have to go through every single one that the memcg refers to?
> 
> I think this is a good idea to allow per memcg per bdi list of dirty mappings.
> 
> It feels like some of this is starting to gel.  I've been sketching
> some of the code to see how the memcg locking will work out.  The
> basic structures I see are:
> 
> struct mem_cgroup {
>         ...
>         /*
>          * For all file pages cached by this memcg sort by bdi.
>          * key is struct backing_dev_info *; value is struct memcg_bdi *
>          * Protected by bdis_lock.
>          */
>         struct rb_root bdis;
>         spinlock_t bdis_lock;  /* or use rcu structure, memcg:bdi set
> could be fairly static */
> };
> 
> struct memcg_bdi {
>         struct backing_dev_info *bdi;
>         /*
>          * key is struct address_space *; value is struct
> memcg_mapping *
>          * memcg_mappings live within either mappings or
> dirty_mappings set.
>          */
>         struct rb_root mappings;
>         struct rb_root dirty_mappings;
>         struct rb_node node;
>         spinlock_t lock; /* protect [dirty_]mappings */
> };
> 
> struct memcg_mapping {
>         struct address_space *mapping;
>         struct memcg_bdi *memcg_bdi;
>         struct rb_node node;
>         atomic_t nr_pages;
>         atomic_t nr_dirty;
> };
> 
> struct page_cgroup {
>         ...
>         struct memcg_mapping *memcg_mapping;
> };
> 
> - each memcg contains a mapping from bdi to memcg_bdi.
> - each memcg_bdi contains two mappings:
>   mappings: from address_space to memcg_mapping for clean pages
>   dirty_mappings: from address_space to memcg_mapping when there are
> some dirty pages
> - each memcg_mapping represents a set of cached pages within an
> bdi,inode,memcg.  All
>  corresponding cached inode pages point to the same memcg_mapping via
>  pc->mapping.  I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
>  limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
>  here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
>  configured out.
> 
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
>  to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
>  levels in data structure.
> 
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
>  memcg.  If refcnt drops to zero, then remove memcg_mapping from the
> memcg_bdi.[dirty_]mappings.
>  Also delete memcg_bdi if last memcg_mapping is removed.
> 
> - account_page_dirtied(): increment nr_dirty.  If first dirty page,
> then move memcg_mapping from memcg_bdi.mappings to
> memcg_bdi.dirty_mappings page counter.  When marking page clean, do
> the opposite.
> 
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
>  background limit, then add memcg to global memcg_over_bg_limit list and use
>  memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher.  If over
>  fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
>  (aka memcg_bdi) accounting structure.
> 
> - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
>  fs-writeback.c so that wb_do_writeback() will return to checking
>  wb_check_background_flush() to check background limits and being
> interruptible if sync flush occurs.  wb_check_background_flush() will
> check the global
>  memcg_over_bg_limit list for memcg that are over their dirty limit.
> Within each memcg write inodes from the dirty_mappings list until a
> threshold page count has been reached (MAX_WRITEBACK_PAGES).  Then
> move to next listed memcg.
> 
> - over_bground_thresh() will determine if memcg is still over bg limit.
>  If over limit, then it per bdi per memcg background flushing will continue.
>  If not over limit then memcg will be removed from memcg_over_bg_limit list.
> 
> I'll post my resulting patches in RFC form, or (at the least) my conclusions.
> 
please take care of force_empty and move_mapping at el. when you do this
and please do rmdir() tests.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]