On Mon, Feb 5, 2024 at 5:06 PM zhaoyang.huang <zhaoyang.huang@xxxxxxxxxx> wrote: > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx> > > Currently, request's ioprio are set via task's schedule priority(when no > blkcg configured), which has high priority tasks possess the privilege on > both of CPU and IO scheduling. Furthermore, most of the write requestes > are launched asynchronosly from kworker which can't know the submitter's > priorities. > This commit works as a hint of original policy by promoting the request > ioprio based on the page/folio's activity. The original idea comes from > LRU_GEN which provides more precised folio activity than before. This > commit try to adjust the request's ioprio when certain part of its folios > are hot, which indicate that this request carry important contents and > need be scheduled ealier. > > This commit provide two sets of exclusive APIs. > > *counting activities by iterating the bio's pages > The filesystem should call bio_set_active_ioprio() before submit_bio on the > spot where they want(buffered read/write/sync etc). > > *counting activities during each call > The filesystem should call bio_set_active_ioprio_page/folio() after > calling bio_add_page/folio. Please be noted that this set of API can not > handle bvec_try_merge_page cases. > > This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases > by calling bio_set_active_ioprio in erofs, ext4, f2fs and blkdev(raw > partition of gendisk) > > Case 1: > script[a] which get significant improved fault time as expected[b]* > where dd's cost also shrink from 55s to 40s. > (1). fault_latency.bin is an ebpf based test tool which measure all task's > iowait latency during page fault when scheduled out/in. > (2). costmem generate page fault by mmaping a file and access the VA. > (3). dd generate concurrent vfs io. > > [a] > ./fault_latency.bin 1 5 > /data/dd_costmem & > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null & > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null & > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null & > costmem -c0 -a2048000 -b128000 -o0 1>/dev/null & > dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 & > dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 & > dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 & > dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000 > [b] > mainline commit > io wait 736us 523us > > * provide correct result for test case 1 in v7 which was compared between > EMMC and UFS wrongly. > > Case 2: > fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest > mainline: 513MiB/s > READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec > READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec > READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec > READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec > READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec > READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec > READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec > READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec > READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec > READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec > > commit: 633MiB/s > READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec > READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec > READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec > READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec > READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec > READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec > READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec > READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec > READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec > READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec > > Case 3: > This commit is also verified by the case of launching camera APP which is > usually considered as heavy working load on both of memory and IO, which > shows 12%-24% improvement. > > ttl = 0 ttl = 50 ttl = 100 > mainline 2267ms 2420ms 2316ms > commit 1992ms 1806ms 1998ms > > case 4: > androbench has no improvment as well as regression in RD/WR test item > while make a 3% improvement in sqlite items. > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx> > --- > change of v2: calculate page's activity via helper function > change of v3: solve layer violation by move API into mm > change of v4: keep block clean by removing the page related API > change of v5: introduce the macros of bio_add_folio/page for read dir. > change of v6: replace the macro of bio_add_xxx by submit_bio which > iterating the bio_vec before launching bio to block layer > change of v7: introduce the function bio_set_active_ioprio > provide updated test result > change of v8: provide two sets of APIs for bio_set_active_ioprio_xxx > --- > --- > block/Kconfig | 27 +++++++++++ > block/bio.c | 94 +++++++++++++++++++++++++++++++++++++++ > include/linux/bio.h | 3 ++ > include/linux/blk_types.h | 7 ++- > 4 files changed, 130 insertions(+), 1 deletion(-) > > diff --git a/block/Kconfig b/block/Kconfig > index f1364d1c0d93..5e721678ea3d 100644 > --- a/block/Kconfig > +++ b/block/Kconfig > @@ -228,6 +228,33 @@ config BLOCK_HOLDER_DEPRECATED > config BLK_MQ_STACKING > bool > > +config BLK_CONT_ACT_BASED_IOPRIO > + bool "Enable content activity based ioprio" > + depends on LRU_GEN > + default n > + help > + This item enable the feature of adjust bio's priority by > + calculating its content's activity. > + This feature works as a hint of original bio_set_ioprio > + which means rt task get no change of its bio->bi_ioprio > + while other tasks have the opportunity to raise the ioprio > + if the bio take certain numbers of active pages. > + The file system should use this by modifying their buffered > + read/write/sync function to raise the bio->bi_ioprio before > + calling submit_bio or after bio_add_page/folio > + > +config BLK_CONT_ACT_BASED_IOPRIO_ITER_BIO > + bool "Counting bio's activity by iterating bio's pages" > + depends on BLK_CONT_ACT_BASED_IOPRIO > + help > + The API under this config counts bio's activity by iterating the bio. > + > +config BLK_CONT_ACT_BASED_IOPRIO_ADD_PAGE > + bool "Counting bio's activity when adding page or folio" > + depends on BLK_CONT_ACT_BASED_IOPRIO && !BLK_CONT_ACT_BASED_IOPRIO_ITER_BIO > + help > + The API under this config count activity during each call buy can't > + handle bvec_try_merge_page cases, please be sure you are ok with that. counting activities during each call can not handle bvec_try_merge_page cases as it returns the valid value. So I provide two sets of exclusive APIs by keeping the iteration one int bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { bool same_page = false; if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) return 0; if (bio->bi_iter.bi_size > UINT_MAX - len) return 0; if (bio->bi_vcnt > 0 && bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1], page, len, offset, &same_page)) { bio->bi_iter.bi_size += len; return len; } if (bio->bi_vcnt >= bio->bi_max_vecs) return 0; __bio_add_page(bio, page, len, offset); return len; } > source "block/Kconfig.iosched" > > endif # BLOCK > diff --git a/block/bio.c b/block/bio.c > index 816d412c06e9..73916a6c319f 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1476,6 +1476,100 @@ void bio_set_pages_dirty(struct bio *bio) > } > EXPORT_SYMBOL_GPL(bio_set_pages_dirty); > > +/* > + * bio_set_active_ioprio() is helper function for fs to adjust the bio's ioprio via > + * calculating the content's activity which measured from MGLRU. > + * The file system should call this function before submit_bio for the buffered > + * read/write/sync. > + */ > +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO > +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO_ITER_BIO > +void bio_set_active_ioprio(struct bio *bio) > +{ > + struct bio_vec bv; > + struct bvec_iter iter; > + struct page *page; > + int class, level, hint; > + int activity = 0; > + int cnt = 0; > + > + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio); > + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio); > + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio); > + /*apply legacy ioprio policy on RT task*/ > + if (task_is_realtime(current)) { > + bio->bi_ioprio = IOPRIO_PRIO_VALUE_HINT(IOPRIO_CLASS_RT, level, hint); > + return; > + } > + bio_for_each_bvec(bv, bio, iter) { > + page = bv.bv_page; > + activity += PageWorkingset(page) ? 1 : 0; > + cnt++; > + if (activity > bio->bi_vcnt / 2) { > + class = IOPRIO_CLASS_RT; > + break; > + } else if (activity > bio->bi_vcnt / 4) { > + /* > + * all itered pages are all active so far > + * then raise to RT directly > + */ > + if (activity == cnt) { > + class = IOPRIO_CLASS_RT; > + break; > + } else > + class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), > + IOPRIO_CLASS_BE); > + } > + } > + if (!class && activity > cnt / 2) > + class = IOPRIO_CLASS_RT; > + else if (!class && activity > cnt / 4) > + class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE); > + > + bio->bi_ioprio = IOPRIO_PRIO_VALUE_HINT(class, level, hint); > +} > +void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio) {} > +void bio_set_active_ioprio_page(struct bio *bio, struct page *page) {} > +#endif > +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO_ADD_PAGE > +/* > + * bio_set_active_ioprio_page/folio are helper functions for counting > + * the bio's activity during each all. However, it can't handle the > + * scenario of bvec_try_merge_page. The submitter can use them if there > + * is no such case in the system(block size < page size) > + */ > +void bio_set_active_ioprio_page(struct bio *bio, struct page *page) > +{ > + int class, level, hint; > + > + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio); > + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio); > + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio); > + bio->bi_cont_act += PageWorkingset(page) ? 1 : 0; > + > + if (bio->bi_cont_act > bio->bi_vcnt / 2) > + class = IOPRIO_CLASS_RT; > + else if (bio->bi_cont_act > bio->bi_vcnt / 4) > + class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE); > + > + bio->bi_ioprio = IOPRIO_PRIO_VALUE_HINT(class, level, hint); > +} > + > +void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio) > +{ > + bio_set_active_ioprio_page(bio, &folio->page); > +} > +void bio_set_active_ioprio(struct bio *bio) {} > +#endif > +#else > +void bio_set_active_ioprio(struct bio *bio) {} > +void bio_set_active_ioprio_page(struct bio *bio, struct page *page) {} > +void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio) {} > +#endif > +EXPORT_SYMBOL_GPL(bio_set_active_ioprio); > +EXPORT_SYMBOL_GPL(bio_set_active_ioprio_page); > +EXPORT_SYMBOL_GPL(bio_set_active_ioprio_folio); > + > /* > * bio_check_pages_dirty() will check that all the BIO's pages are still dirty. > * If they are, then fine. If, however, some pages are clean then they must > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 41d417ee1349..35221ee3dd54 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -487,6 +487,9 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter); > void __bio_release_pages(struct bio *bio, bool mark_dirty); > extern void bio_set_pages_dirty(struct bio *bio); > extern void bio_check_pages_dirty(struct bio *bio); > +extern void bio_set_active_ioprio(struct bio *bio); > +extern void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio); > +extern void bio_set_active_ioprio_page(struct bio *bio, struct page *page); > > extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, > struct bio *src, struct bvec_iter *src_iter); > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index d5c5e59ddbd2..a3a18b9a5168 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -314,7 +314,12 @@ struct bio { > struct bio_vec *bi_io_vec; /* the actual vec list */ > > struct bio_set *bi_pool; > - > +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO > + /* > + * bi_cont_act record total activities of bi_io_vec->pages > + */ > + u64 bi_cont_act; > +#endif > /* > * We can inline a number of vecs at the end of the bio, to avoid > * double allocations for a small number of bio_vecs. This member > -- > 2.25.1 >