Hello, On Tue 06-01-15 16:25:37, Tejun Heo wrote: > blkio cgroup (blkcg) is severely crippled in that it can only control > read and direct write IOs. blkcg can't tell which cgroup should be > held responsible for a given writeback IO and charges all of them to > the root cgroup - all normal write traffic ends up in the root cgroup. > Although the problem has been identified years ago, mainly because it > interacts with so many subsystems, it hasn't been solved yet. > > This patchset finally implements cgroup writeback support so that > writeback of a page is attributed to the corresponding blkcg of the > memcg that the page belongs to. > > Overall design > -------------- > > * This requires cooperation between memcg and blkcg. The IOs are > charged to the blkcg that the page's memcg corresponds to. This > currently works only on the unified hierarchy. > > * Each memcg maintains reference counted front and back pointers to > the correspending blkcg. Whenever a page gets dirtied or initiates > writeback, it uses the blkcg the front one points to. The reference > counting ensures that the association remains till the page is done > and having front and back pointers guarantees that the association > can change without being live-locked by pages being contiuously > dirtied. > > * struct bdi_writeback (wb) was always embedded in struct > backing_dev_info (bdi) and the distinction between the two wasn't > clear. This patchset makes wb operate as an independent writeback > > execution. bdi->wb is still embedded and serves the root cgroup but > other wb's can be associated with a single bdi each serving a > non-root wb. > > * All writeback operations are made per-wb instead of per-bdi. > bdi-wide operations are split across all member wb's. If some > finite amount needs to be distributed, be it number of pages to > writeback or bdi->min/max_ratio, it's distributed according to the > bandwidth proportion a wb has in the bdi. > > * Non-root wb's host and write back only dirty pages (I_DIRTY_PAGES). > I_DIRTY_[DATA]SYNC is always handled by the root wb. > > * An inode may have pages dirtied by different memcgs, which naturally > means that it should be able to be dirtied against multiple wb's. > To support linking an inode against multiple wb's, iwbl > (inode_wb_link) is introduced. An inode has multiple iwbl's > associated with it if it's dirty against multiple wb's. Is the ability for inode to belong to multiple memcgs really worth the effort? It brings significant complications (see also Dave's email) and the last time we were discussing cgroup writeback support the demand from users for this was small... How hard would it be to just start with an implementation which attaches the inode to the first memcg that dirties it (and detaches it when inode gets clean)? And implement sharing of inodes among mecgs only if there's a real demand for it? Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html