Hello, This is v2 of cgroup writeback support patchset. Changes from the last take[L] are * Dropped dirtying an inode against multiple cgroups at the same time. At a given time, an inode always belongs to one cgroup bdi_writeback. It's currently associated on the first-use but a later patchset will implement wb switching on foreign inodes. This removed the colored blkcg_css association pointer in memcg and inode_wb_link. * With the above change, cgroup writeback is now responsible for perfomring impedance matching when the association between memcg and blkcg changes. This is achieved by making bdi_writeback per-memcg rather than per-blkcg. On the unified hierarchy, blkcg implicitly enables memcg, so this gives a unique bdi_writeback for each memcg - blkcg combination. This causes some complexities when the mapping changes and with congesttion state handling as there can be multiple wb's mapped to one blkcg. * Refreshed on top of the current block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set"). This includes refreshing on top of the new ->b_dirty_time changes. blkio cgroup (blkcg) is severely crippled in that it can only control read and direct write IOs. blkcg can't tell which cgroup should be held responsible for a given writeback IO and charges all of them to the root cgroup - all normal write traffic ends up in the root cgroup. Although the problem has been identified years ago, mainly because it interacts with so many subsystems, it hasn't been solved yet. This patchset finally implements cgroup writeback support so that writeback of a page is attributed to the corresponding blkcg of the memcg that the page belongs to. Overall design -------------- * This requires cooperation between memcg and blkcg. Each inode is assigned to the blkcg mapped to the memcg being dirtied. * struct bdi_writeback (wb) was always embedded in struct backing_dev_info (bdi) and the distinction between the two wasn't clear. This patchset makes wb operate as an independent writeback execution domain. bdi->wb is still embedded and serves the root cgroup but there can be other wb's for other cgroups. * Each wb is associated with memcg. As memcg is implicitly enabled by blkcg on the unified hierarchy, this gives a unique wb for each memcg-blkcg combination. When memcg-blkcg mapping changes, a new wb is created and the existing wb is unlinked and drained. * An inode is associated with the matching wb when it gets dirtied for the first time and written back by that wb. A later patchset will implement dynamic wb switching. * All writeback operations are made per-wb instead of per-bdi. bdi-wide operations are split across all member wb's. If some finite amount needs to be distributed, be it number of pages to writeback or bdi->min/max_ratio, it's distributed according to the bandwidth proportion a wb has in the bdi. * cgroup writeback support adds one pointer to struct inode. Missing pieces -------------- * It requires some cooperation from the filesystem and currently only works with ext2. The changes necessary on the filesystem side are almost trivial. I'll write up a documentation on it. * blk-throttle works but cfq-iosched isn't ready for writebacks coming down with different cgroups. cfq-iosched should be updated to have a writeback ioc per cgroup and route writeback IOs through it. How to test ----------- * Boot with kernel option "cgroup__DEVEL__legacy_files_on_dfl". * umount /sys/fs/cgroup/memory umount /sys/fs/cgroup/blkio mkdir /sys/fs/cgroup/unified mount -t cgroup -o __DEVEL__sane_behavior cgroup /sys/fs/cgroup/unified echo +blkio > /sys/fs/cgroup/unified/cgroup.subtree_control * Build the cgroup hierarchy (don't forget to enable blkio using subtree_control) and put processes in cgroups and run tests on ext2 filesystems and blkio.throttle.* knobs. This patchset contains the following 48 patches. 0001-memcg-add-per-cgroup-dirty-page-accounting.patch 0002-blkcg-move-block-blk-cgroup.h-to-include-linux-blk-c.patch 0003-update-CONFIG_BLK_CGROUP-dummies-in-include-linux-bl.patch 0004-memcg-add-mem_cgroup_root_css.patch 0005-blkcg-add-blkcg_root_css.patch 0006-cgroup-block-implement-task_get_css-and-use-it-in-bi.patch 0007-blkcg-implement-task_get_blkcg_css.patch 0008-blkcg-implement-bio_associate_blkcg.patch 0009-memcg-implement-mem_cgroup_css_from_page.patch 0010-writeback-move-backing_dev_info-state-into-bdi_write.patch 0011-writeback-move-backing_dev_info-bdi_stat-into-bdi_wr.patch 0012-writeback-move-bandwidth-related-fields-from-backing.patch 0013-writeback-s-bdi-wb-in-mm-page-writeback.c.patch 0014-writeback-move-backing_dev_info-wb_lock-and-worklist.patch 0015-writeback-reorganize-mm-backing-dev.c.patch 0016-writeback-separate-out-include-linux-backing-dev-def.patch 0017-bdi-make-inode_to_bdi-inline.patch 0018-writeback-add-gfp-to-wb_init.patch 0019-bdi-separate-out-congested-state-into-a-separate-str.patch 0020-writeback-add-CONFIG-BDI_CAP-FS-_CGROUP_WRITEBACK.patch 0021-writeback-make-backing_dev_info-host-cgroup-specific.patch 0022-writeback-blkcg-associate-each-blkcg_gq-with-the-cor.patch 0023-writeback-attribute-stats-to-the-matching-per-cgroup.patch 0024-writeback-let-balance_dirty_pages-work-on-the-matchi.patch 0025-writeback-make-congestion-functions-per-bdi_writebac.patch 0026-writeback-blkcg-restructure-blk_-set-clear-_queue_co.patch 0027-writeback-blkcg-propagate-non-root-blkcg-congestion-.patch 0028-writeback-implement-and-use-mapping_congested.patch 0029-writeback-implement-WB_has_dirty_io-wb_state-flag.patch 0030-writeback-implement-backing_dev_info-tot_write_bandw.patch 0031-writeback-make-bdi_has_dirty_io-take-multiple-bdi_wr.patch 0032-writeback-don-t-issue-wb_writeback_work-if-clean.patch 0033-writeback-make-bdi-min-max_ratio-handling-cgroup-wri.patch 0034-writeback-implement-bdi_for_each_wb.patch 0035-writeback-remove-bdi_start_writeback.patch 0036-writeback-make-laptop_mode_timer_fn-handle-multiple-.patch 0037-writeback-make-writeback_in_progress-take-bdi_writeb.patch 0038-writeback-make-bdi_start_background_writeback-take-b.patch 0039-writeback-make-wakeup_flusher_threads-handle-multipl.patch 0040-writeback-add-wb_writeback_work-auto_free.patch 0041-writeback-implement-bdi_wait_for_completion.patch 0042-writeback-implement-wb_wait_for_single_work.patch 0043-writeback-restructure-try_writeback_inodes_sb-_nr.patch 0044-writeback-make-writeback-initiation-functions-handle.patch 0045-writeback-dirty-inodes-against-their-matching-cgroup.patch 0046-buffer-writeback-make-__block_write_full_page-honor-.patch 0047-mpage-make-__mpage_writepage-honor-cgroup-writeback.patch 0048-ext2-enable-cgroup-writeback-support.patch 0001-0019 are preps. 0020-0045 gradually convert writeback code so that wb (bdi_writeback) operates as an independent writeback domain instead of bdi (backing_dev_info), a single bdi can have multiple per-cgroup wb's working for it, and per-bdi operations are translated and distributed to all its member wb's. 0046-0048 make lower layers to properly propagate the cgroup association from the writeback layer and enable cgroup writeback on ext2. This patchset is on top of block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set") + [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-20150322 diffstat follows. Thanks. Documentation/cgroups/memory.txt | 1 block/bio.c | 35 +- block/blk-cgroup.c | 28 + block/blk-cgroup.h | 603 ----------------------------------- block/blk-core.c | 70 ++-- block/blk-integrity.c | 1 block/blk-sysfs.c | 3 block/blk-throttle.c | 2 block/bounce.c | 1 block/cfq-iosched.c | 2 block/elevator.c | 2 block/genhd.c | 1 drivers/block/drbd/drbd_int.h | 1 drivers/block/drbd/drbd_main.c | 10 drivers/block/pktcdvd.c | 1 drivers/char/raw.c | 1 drivers/md/bcache/request.c | 1 drivers/md/dm.c | 2 drivers/md/dm.h | 1 drivers/md/md.h | 1 drivers/md/raid1.c | 4 drivers/md/raid10.c | 2 drivers/mtd/devices/block2mtd.c | 1 fs/block_dev.c | 9 fs/buffer.c | 60 ++- fs/ext2/super.c | 2 fs/ext4/extents.c | 1 fs/ext4/mballoc.c | 1 fs/ext4/super.c | 1 fs/f2fs/node.c | 4 fs/f2fs/segment.h | 3 fs/fs-writeback.c | 609 +++++++++++++++++++++++++---------- fs/fuse/file.c | 12 fs/gfs2/super.c | 2 fs/hfs/super.c | 1 fs/hfsplus/super.c | 1 fs/inode.c | 1 fs/mpage.c | 2 fs/nfs/filelayout/filelayout.c | 1 fs/nfs/internal.h | 2 fs/nfs/write.c | 3 fs/ocfs2/file.c | 1 fs/reiserfs/super.c | 1 fs/ufs/super.c | 1 fs/xfs/xfs_aops.c | 12 fs/xfs/xfs_file.c | 1 include/linux/backing-dev-defs.h | 188 ++++++++++ include/linux/backing-dev.h | 572 ++++++++++++++++++++++++--------- include/linux/bio.h | 3 include/linux/blk-cgroup.h | 631 ++++++++++++++++++++++++++++++++++++ include/linux/blkdev.h | 21 - include/linux/cgroup.h | 25 + include/linux/fs.h | 13 include/linux/memcontrol.h | 10 include/linux/mm.h | 3 include/linux/pagemap.h | 3 include/linux/writeback.h | 25 - include/trace/events/writeback.h | 8 init/Kconfig | 5 mm/backing-dev.c | 667 +++++++++++++++++++++++++++++++-------- mm/fadvise.c | 2 mm/filemap.c | 32 + mm/madvise.c | 1 mm/memcontrol.c | 59 +++ mm/page-writeback.c | 649 +++++++++++++++++++++---------------- mm/readahead.c | 2 mm/rmap.c | 2 mm/truncate.c | 25 + mm/vmscan.c | 29 + 69 files changed, 2979 insertions(+), 1501 deletions(-) -- tejun [L] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email-tj@xxxxxxxxxx [1] http://lkml.kernel.org/g/20150323041848.GA8991@xxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html