Hi Hou Tao, On Wed, Jun 03, 2020 at 03:39:31PM +0800, Hou Tao wrote: > When there are many free-bit waiters, current batch wakeup method will > wake up at most wake_batch processes when wake_batch bits are freed. > The perfect result is each process will get a free bit, however the > real result is that a waken-up process may being unable to get > a free bit and will call io_schedule() multiple times. That's because > other processes (e.g. wake-up before) in the same wake-up batch > may have already allocated multiple free bits. > > And the race leads to two problems. The first one is the unnecessary > context switch, because multiple processes are waken up and then > go to sleep afterwards. And the second one is the performance > degradation when there is spatial locality between requests from > one process (e.g. split IO for HDD), because one process can not > allocated requests continuously for the split IOs, and > the sequential IOs will be dispatched separatedly. I guess this way is a bit worse for HDD since sequential IO may be interrupted by other context. > > To fix the problem, we mimic the way how SQ handles this situation: Do you mean the SQ way is the congestion control code in __get_request()? If not, could you provide more background of SQ's way for this issue? Cause it isn't easy for me to associate your approach with SQ's code. > 1) stash a bulk of free bits > 2) wake up a process when a new bit is freed > 3) woken-up process consumes the stashed free bits > 4) when stashed free bits are exhausted, goto step 1) > > Because the tag allocation path or io submit path is much faster than > the tag free path, so when the race for free tags is intensive, Indeed, I guess you mean bio_endio is slow. > we can ensure: > 1) only few processes will be waken up and will exhaust the stashed > free bits quickly. > 2) these processes will be able to allocate multiple requests > continuously. > > An alternative fix is to dynamically adjust the number of woken-up > process according to the number of waiters and busy bits, instead of > using wake_batch each time in __sbq_wake_up(). However it will need > to record the number of busy bits all the time, so use the > stash-wake-use method instead. > > The following is the result of a simple fio test: > > 1. fio (random read, 1MB, libaio, iodepth=1024) > > (1) 4TB HDD (max_sectors_kb=256) > > IOPS (bs=1MB) > jobs | 4.18-sq | 5.6.15 | 5.6.15-patched | > 1 | 120 | 120 | 119 > 24 | 120 | 105 | 121 > 48 | 122 | 102 | 121 > 72 | 120 | 100 | 119 > > context switch per second > jobs | 4.18-sq | 5.6.15 | 5.6.15-patched | > 1 | 1058 | 1162 | 1188 > 24 | 1047 | 1715 | 1105 > 48 | 1109 | 1967 | 1105 > 72 | 1084 | 1908 | 1106 > > (2) 1.8TB SSD (set max_sectors_kb=256) > > IOPS (bs=1MB) > jobs | 4.18-sq | 5.6.15 | 5.6.15-patched | > 1 | 1077 | 1075 | 1076 > 24 | 1079 | 1075 | 1076 > 48 | 1077 | 1076 | 1076 > 72 | 1077 | 1076 | 1077 > > context switch per second > jobs | 4.18-sq | 5.6.15 | 5.6.15-patched | > 1 | 1833 | 5123 | 5264 > 24 | 2143 | 15238 | 3859 > 48 | 2182 | 19015 | 3617 > 72 | 2268 | 19050 | 3662 > > (3) 1.5TB nvme (set max_sectors_kb=256) > > 4 read queue, 72 CPU > > IOPS (bs=1MB) > jobs | 5.6.15 | 5.6.15-patched | > 1 | 3018 | 3018 > 18 | 3015 | 3016 > 36 | 3001 | 3005 > 54 | 2993 | 2997 > 72 | 2984 | 2990 > > context switch per second > jobs | 5.6.15 | 5.6.15-patched | > 1 | 6292 | 6469 > 18 | 19428 | 4253 > 36 | 21290 | 3928 > 54 | 23060 | 3957 > 72 | 24221 | 4054 > > Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > --- > Hi, > > We found the problems (excessive context switch and few performance > degradation) during the performance comparison between blk-sq (4.18) > and blk-mq (5.16) on HDD, but we can not find a better way to fix it. > > It seems that in order to implement batched request allocation for > single process, we need to use an atomic variable to track > the number of busy bits. It's suitable for HDD or SDD, because the > IO latency is greater than 1ms, but no sure whether or not it's OK > for NVMe device. Do you have benchmark on NVMe/SSD with 4k BS? > > Suggestions and comments are welcome. > > Regards, > Tao > --- > block/blk-mq-tag.c | 4 ++++ > include/linux/sbitmap.h | 7 ++++++ > lib/sbitmap.c | 49 +++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 60 insertions(+) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 586c9d6e904a..fd601fa6c684 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -180,6 +180,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) > sbitmap_finish_wait(bt, ws, &wait); > > found_tag: > + if (READ_ONCE(bt->stash_ready) && > + !atomic_dec_if_positive(&bt->stashed_bits)) > + WRITE_ONCE(bt->stash_ready, false); > + Is it doable to move the above code into sbitmap_queue_do_stash_and_wake_up(), after wake_up(&ws->wait)? Or at least it should be done for successful allocation after context switch? Thanks, Ming