So I tested this on hisi_sas with x12 SAS SSDs, and performance with
"mq-
deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
performance drops alot, to maybe 700K IOPS. Do you have a similar
experience?
I am using mq-deadline only for HDD. I have not tried on SSD since it
is not
useful scheduler for SSDs.
I ask as I only have SAS SSDs to test.
I noticed that when I used mq-deadline, performance drop starts if I have
more number of drives.
I am running <fio> script which has 64 Drives, 64 thread and all
treads are
bound to local numa node which has 36 logical cores.
I noticed that lock contention is in " dd_dispatch_request". I am not
sure
why there is a no penalty of same lock in nr_hw_queue = 1 mode.
So this could be just pre-existing issue of exposing multiple queues for
SCSI HBAs combined with mq-deadline iosched. I mean, that's really the
only significant change in this series, apart from the shared sbitmap,
and, at this point, I don't think that is the issue.
As an experiment, I modified hisi_sas mainline driver to expose hw
queues and manage tags itself, and I see the same issue I mentioned:
Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0
iops] [eta
Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0
iops] [eta
Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0
iops] [eta
Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0
iops] [eta
Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0
iops] [eta
Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0
iops] [eta 0
The odd time I see "sched: RT throttling activated" around the time the
throughput falls. I think issue is the per-queue threaded irq threaded
handlers consuming too many cycles. With "none" io scheduler, IOPS is
flat at around 2M.
static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
struct deadline_data *dd = hctx->queue->elevator->elevator_data;
struct request *rq;
spin_lock(&dd->lock);
So if multiple hctx's are accessing this lock, then much contention
possible.
rq = __dd_dispatch_request(dd);
spin_unlock(&dd->lock);
return rq;
}
Here is perf report -
- 1.04% 0.99% kworker/18:1H+k [kernel.vmlinux] [k]
native_queued_spin_lock_slowpath
0.99% ret_from_fork
- kthread
- worker_thread
- 0.98% process_one_work
- 0.98% __blk_mq_run_hw_queue
- blk_mq_sched_dispatch_requests
- 0.98% blk_mq_do_dispatch_sched
- 0.97% dd_dispatch_request
+ 0.97% queued_spin_lock_slowpath
+ 1.04% 0.00% kworker/18:1H+k [kernel.vmlinux] [k]
queued_spin_lock_slowpath
+ 1.03% 0.95% kworker/19:1H-k [kernel.vmlinux] [k]
native_queued_spin_lock_slowpath
+ 1.03% 0.00% kworker/19:1H-k [kernel.vmlinux] [k]
queued_spin_lock_slowpath
+ 1.02% 0.97% kworker/20:1H+k [kernel.vmlinux] [k]
native_queued_spin_lock_slowpath
+ 1.02% 0.00% kworker/20:1H+k [kernel.vmlinux] [k]
queued_spin_lock_slowpath
+ 1.01% 0.96% kworker/21:1H+k [kernel.vmlinux] [k]
native_queued_spin_lock_slowpath
I'll try to capture a perf report and compare to mine.
Mine is spending a huge amount of time (circa 33% on a cpu servicing
completion irqs) in mod_delayed_work_on():
--79.89%--sas_scsi_task_done |
|--76.72%--scsi_mq_done
| |
| --76.53%--blk_mq_complete_request
| |
| |--74.81%--scsi_softirq_done
| | |
| | --73.91%--scsi_finish_command
| | |
| | |--72.11%--scsi_io_completion
| | | |
| | | --71.89%--scsi_end_request
| | | |
| | | |--40.82%--blk_mq_run_hw_queues
| | | | |
| | | | |--35.86%--blk_mq_run_hw_queue
| | | | | |
| | | | | --33.59%--__blk_mq_delay_run_hw_queue
| | | | | |
| | | | | --33.38%--kblockd_mod_delayed_work_on
| | | | | |
| | | | | --33.31%--mod_delayed_work_on
hmmmm...
Thanks,
John