On 9/25/20 8:39 PM, Xianting Tian wrote: > We found blk_mq_alloc_rq_maps() takes more time in kernel space when > testing nvme device hot-plugging. The test and anlysis as below. > > Debug code, > 1, blk_mq_alloc_rq_maps(): > u64 start, end; > depth = set->queue_depth; > start = ktime_get_ns(); > pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n", > current->pid, current->comm, current->nvcsw, current->nivcsw, > set->queue_depth, set->nr_hw_queues); > do { > err = __blk_mq_alloc_rq_maps(set); > if (!err) > break; > > set->queue_depth >>= 1; > if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) { > err = -ENOMEM; > break; > } > } while (set->queue_depth); > end = ktime_get_ns(); > pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n", > current->pid, current->comm, > current->nvcsw, current->nivcsw, end - start); > > 2, __blk_mq_alloc_rq_maps(): > u64 start, end; > for (i = 0; i < set->nr_hw_queues; i++) { > start = ktime_get_ns(); > if (!__blk_mq_alloc_rq_map(set, i)) > goto out_unwind; > end = ktime_get_ns(); > pr_err("hw queue %d init cost time %lld ns\n", i, end - start); > } > > Test nvme hot-plugging with above debug code, we found it totally cost more > than 3ms in kernel space without being scheduled out when alloc rqs for all > 16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost > time will be increased with hw queue number and queue depth increasing. And > in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try > "queue_depth >>= 1", more time will be consumed. > [ 428.428771] nvme nvme0: pci function 10000:01:00.0 > [ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002) > [ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A > [ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI > [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1 > [ 432.593404] hw queue 0 init cost time 22883 ns > [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns > [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues > [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16 > [ 432.596203] hw queue 0 init cost time 242630 ns > [ 432.596441] hw queue 1 init cost time 235913 ns > [ 432.596659] hw queue 2 init cost time 216461 ns > [ 432.596877] hw queue 3 init cost time 215851 ns > [ 432.597107] hw queue 4 init cost time 228406 ns > [ 432.597336] hw queue 5 init cost time 227298 ns > [ 432.597564] hw queue 6 init cost time 224633 ns > [ 432.597785] hw queue 7 init cost time 219954 ns > [ 432.597937] hw queue 8 init cost time 150930 ns > [ 432.598082] hw queue 9 init cost time 143496 ns > [ 432.598231] hw queue 10 init cost time 147261 ns > [ 432.598397] hw queue 11 init cost time 164522 ns > [ 432.598542] hw queue 12 init cost time 143401 ns > [ 432.598692] hw queue 13 init cost time 148934 ns > [ 432.598841] hw queue 14 init cost time 147194 ns > [ 432.598991] hw queue 15 init cost time 148942 ns > [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns > [ 432.602611] nvme0n1: p1 > > So use this patch to trigger schedule between each hw queue init, to avoid > other threads getting stuck. It is not in atomic context when executing > __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched(). Applied for 5.10, thanks. -- Jens Axboe