On Fri, 18 Mar 2011 00:57:09 -0700 Greg Thelen <gthelen@xxxxxxxxxx> wrote: > On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote: > >> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and > >> Vivek have had some discussion around how memcg and writeback mesh. > >> In my mind, the > >> discussions in 8/9 are starting to blend with this thread. > >> > >> I have been thinking about Johannes' struct memcg_mapping. ÂI think the idea > >> may address several of the issues being discussed, especially > >> interaction between > >> IO-less balance_dirty_pages() and memcg writeback. > >> > >> Here is my thinking. ÂFeedback is most welcome! > >> > >> The data structures: > >> - struct memcg_mapping { > >> Â Â Â Âstruct address_space *mapping; > >> Â Â Â Âstruct mem_cgroup *memcg; > >> Â Â Â Âint refcnt; > >> Â }; > >> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi. > >> - each memcg_bdi contains a mapping from inode to memcg_mapping. ÂThis may be a > >> Â very large set representing many cached inodes. > >> - each memcg_mapping represents all pages within an bdi,inode,memcg. ÂAll > >> Â corresponding cached inode pages point to the same memcg_mapping via > >> Â pc->mapping. ÂI assume that all pages of inode belong to no more than one bdi. > >> - manage a global list of memcg that are over their respective background dirty > >> Â limit. > >> - i_mapping continues to point to a traditional non-memcg mapping (no change > >> Â here). > >> - none of these memcg_* structures affect root cgroup or kernels with memcg > >> Â configured out. > > > > So structures roughly like this: > > > > struct mem_cgroup { > > Â Â Â Â... > > Â Â Â Â/* key is struct backing_dev_info * */ > > Â Â Â Âstruct rb_root memcg_bdis; > > }; > > > > struct memcg_bdi { > > Â Â Â Â/* key is struct address_space * */ > > Â Â Â Âstruct rb_root memcg_mappings; > > Â Â Â Âstruct rb_node node; > > }; > > > > struct memcg_mapping { > > Â Â Â Âstruct address_space *mapping; > > Â Â Â Âstruct mem_cgroup *memcg; > > Â Â Â Âstruct rb_node node; > > Â Â Â Âatomic_t count; > > }; > > > > struct page_cgroup { > > Â Â Â Â... > > Â Â Â Âstruct memcg_mapping *memcg_mapping; > > }; > > > >> The routines under discussion: > >> - memcg charging a new inode page to a memcg: will use inode->mapping and inode > >> Â to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing > >> Â levels in data structure. > >> > >> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate > >> Â memcg. ÂIf refcnt drops to zero, then remove memcg_mapping from the memcg_bdi. > >> Â Also delete memcg_bdi if last memcg_mapping is removed. > >> > >> - account_page_dirtied(): nothing new here, continue to set the per-page flags > >> Â and increment the memcg per-cpu dirty page counter. ÂSame goes for routines > >> Â that mark pages in writeback and clean states. > > > > We may want to remember the dirty memcg_mappings so that on writeback > > we don't have to go through every single one that the memcg refers to? > > I think this is a good idea to allow per memcg per bdi list of dirty mappings. > > It feels like some of this is starting to gel. I've been sketching > some of the code to see how the memcg locking will work out. The > basic structures I see are: > > struct mem_cgroup { > ... > /* > * For all file pages cached by this memcg sort by bdi. > * key is struct backing_dev_info *; value is struct memcg_bdi * > * Protected by bdis_lock. > */ > struct rb_root bdis; > spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set > could be fairly static */ > }; > > struct memcg_bdi { > struct backing_dev_info *bdi; > /* > * key is struct address_space *; value is struct > memcg_mapping * > * memcg_mappings live within either mappings or > dirty_mappings set. > */ > struct rb_root mappings; > struct rb_root dirty_mappings; > struct rb_node node; > spinlock_t lock; /* protect [dirty_]mappings */ > }; > > struct memcg_mapping { > struct address_space *mapping; > struct memcg_bdi *memcg_bdi; > struct rb_node node; > atomic_t nr_pages; > atomic_t nr_dirty; > }; > > struct page_cgroup { > ... > struct memcg_mapping *memcg_mapping; > }; > > - each memcg contains a mapping from bdi to memcg_bdi. > - each memcg_bdi contains two mappings: > mappings: from address_space to memcg_mapping for clean pages > dirty_mappings: from address_space to memcg_mapping when there are > some dirty pages > - each memcg_mapping represents a set of cached pages within an > bdi,inode,memcg. All > corresponding cached inode pages point to the same memcg_mapping via > pc->mapping. I assume that all pages of inode belong to no more than one bdi. > - manage a global list of memcg that are over their respective background dirty > limit. > - i_mapping continues to point to a traditional non-memcg mapping (no change > here). > - none of these memcg_* structures affect root cgroup or kernels with memcg > configured out. > > The routines under discussion: > - memcg charging a new inode page to a memcg: will use inode->mapping and inode > to walk memcg -> memcg_bdi -> mappings and lazily allocating missing > levels in data structure. > > - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate > memcg. If refcnt drops to zero, then remove memcg_mapping from the > memcg_bdi.[dirty_]mappings. > Also delete memcg_bdi if last memcg_mapping is removed. > > - account_page_dirtied(): increment nr_dirty. If first dirty page, > then move memcg_mapping from memcg_bdi.mappings to > memcg_bdi.dirty_mappings page counter. When marking page clean, do > the opposite. > > - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above > background limit, then add memcg to global memcg_over_bg_limit list and use > memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over > fg limit, then use IO-less style foreground throttling with per-memcg per-bdi > (aka memcg_bdi) accounting structure. > > - bdi writeback: will revert some of the mmotm memcg dirty limit changes to > fs-writeback.c so that wb_do_writeback() will return to checking > wb_check_background_flush() to check background limits and being > interruptible if sync flush occurs. wb_check_background_flush() will > check the global > memcg_over_bg_limit list for memcg that are over their dirty limit. > Within each memcg write inodes from the dirty_mappings list until a > threshold page count has been reached (MAX_WRITEBACK_PAGES). Then > move to next listed memcg. > > - over_bground_thresh() will determine if memcg is still over bg limit. > If over limit, then it per bdi per memcg background flushing will continue. > If not over limit then memcg will be removed from memcg_over_bg_limit list. > > I'll post my resulting patches in RFC form, or (at the least) my conclusions. > please take care of force_empty and move_mapping at el. when you do this and please do rmdir() tests. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html