On 2/10/21 2:42 AM, Christoph Hellwig wrote: > On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote: >> Doing a work per cmd can lead to lots of threads being created. >> This patch just replaces the completion work per cmd with a per cpu >> list. Combined with the first patches this allows tcm loop on top of >> initiators like iser to go from around 700K IOPs to 1000K and reduces >> the number of threads that get created when the system is under heavy >> load and hitting the initiator drivers tagging limits. > > OTOH it does increase completion latency, which might be the preference > for some workloads. Do we need a tunable here? > I didn't see an increase with higher perf setups like 100G (really 60 somehting) iser/srp and even using ram disks. I think it's because either way we run one cmd on each cpu at a time, or if we do kick off cmds on different threads they still run on the same cpu so latency didn't change overall. 1. Without my patch each cmd has a work struct. target_complete_cmd does queue_work_on which puts the cmd's work on the worker pool for that cpu. With faster drivers, workqueue.c does the equivalent of a loop over each work/cmd running one work on the cpu, that completes, then running the next. If we get to the point where the workqueue code runs cmds in different threads on the same cpu, they end up sharing the cpu so latency wasn't changing (I didn't debug this very much). I can't tell how much the rescuer thread comes into this though. Maybe it could come in and run cmds on a different cpu, and that would help. 2. With the patch, we add the cmd to the device's per cpu list, then do queue_work_on on the list's per cpu work. queue_work_on then either hits the pending check or we run like #1 but on a per device basis instead of per cmd. However, 1. While I was checking this I noticed for commands that set transport_complete_callback, then I need to do the per cmd work, since they can be slow. I'll fix this on the next posting. 2. For the latency issue, did you want the cmd to be able to run on any cpu? I think this won't happen now under normal cases, because the workqueue does not set WQ_UNBOUND. Did we want a tunable for that? >> +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd, >> + int cpu, struct workqueue_struct *wq) >> { >> - struct se_cmd *cmd = container_of(work, struct se_cmd, work); >> + llist_add(&se_cmd->se_cmd_list, &q->cmd_list); >> + queue_work_on(cpu, wq, &q->work); >> +} > > Do we need this helper at all? Having it open coded in the two callers > would seem easier to follow to me. I can do that.