On Fri 22-05-15 17:13:37, Tejun Heo wrote: > For the planned cgroup writeback support, on each bdi > (backing_dev_info), each memcg will be served by a separate wb > (bdi_writeback). This patch updates bdi so that a bdi can host > multiple wbs (bdi_writebacks). > > On the default hierarchy, blkcg implicitly enables memcg. This allows > using memcg's page ownership for attributing writeback IOs, and every > memcg - blkcg combination can be served by its own wb by assigning a > dedicated wb to each memcg. This means that there may be multiple > wb's of a bdi mapped to the same blkcg. As congested state is per > blkcg - bdi combination, those wb's should share the same congested > state. This is achieved by tracking congested state via > bdi_writeback_congested structs which are keyed by blkcg. > > bdi->wb remains unchanged and will keep serving the root cgroup. > cgwb's (cgroup wb's) for non-root cgroups are created on-demand or > looked up while dirtying an inode according to the memcg of the page > being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree > by its memcg id. Once an inode is associated with its wb, it can be > retrieved using inode_to_wb(). > > Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all > pages will keep being associated with bdi->wb. > > v3: inode_attach_wb() in account_page_dirtied() moved inside > mapping_cap_account_dirty() block where it's known to be !NULL. > Also, an unnecessary NULL check before kfree() removed. Both > detected by the kbuild bot. > > v2: Updated so that wb association is per inode and wb is per memcg > rather than blkcg. It may be a good place to explain in this changelog (and add that explanation to a comment before the definition of struct bdi_writeback) why are the writeback structures per memcg and not per coarser blkcg. I was pondering about it for a while before I realized that amount of avaliable memory and thus dirty limits are a memcg property so we have to be able to writeback only a specific memcg. It would be nice if one didn't have to figure this out on his own (although it's kind of obvious once you realize that ;). Other than that the patch looks good so you can add: Reviewed-by: Jan Kara <jack@xxxxxxxx> A few nits below. > +/** > + * wb_find_current - find wb for %current on a bdi > + * @bdi: bdi of interest > + * > + * Find the wb of @bdi which matches both the memcg and blkcg of %current. > + * Must be called under rcu_read_lock() which protects the returend wb. ^^ returned > + * NULL if not found. > + */ > +static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) > +{ > + struct cgroup_subsys_state *memcg_css; > + struct bdi_writeback *wb; > + > + memcg_css = task_css(current, memory_cgrp_id); > + if (!memcg_css->parent) > + return &bdi->wb; > + > + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); > + > + /* > + * %current's blkcg equals the effective blkcg of its memcg. No > + * need to use the relatively expensive cgroup_get_e_css(). > + */ > + if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id))) > + return wb; This won't hit only in case where memcg moves to a different blkcg? Just want to make sure I understand things right... ... > +/** > + * wb_congested_put - put a wb_congested > + * @congested: wb_congested to put > + * > + * Put @congested and destroy it if the refcnt reaches zero. > + */ > +void wb_congested_put(struct bdi_writeback_congested *congested) > +{ > + struct backing_dev_info *bdi = congested->bdi; > + unsigned long flags; > + > + if (congested->blkcg_id == 1) > + return; > + > + local_irq_save(flags); > + if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) { > + local_irq_restore(flags); > + return; > + } > + > + rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree); > + spin_unlock_irqrestore(&cgwb_lock, flags); > + kfree(congested); > + > + if (atomic_dec_and_test(&bdi->usage_cnt)) > + wake_up_all(&cgwb_release_wait); Maybe we could have a small wrapper for dropping bdi->usage_cnt? If someone forgets to wake up cgwb_release_wait after dropping the ref count, it will be somewhat difficult to chase down that call site... ... > +#ifdef CONFIG_CGROUP_WRITEBACK > + > +struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg) > +{ > + return &memcg->cgwb_list; > +} > + > +#endif /* CONFIG_CGROUP_WRITEBACK */ > + What is the reason for this wrapper? It doesn't seem particularly useful... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html