On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote: > Doing a work per cmd can lead to lots of threads being created. > This patch just replaces the completion work per cmd with a per cpu > list. Combined with the first patches this allows tcm loop on top of > initiators like iser to go from around 700K IOPs to 1000K and reduces > the number of threads that get created when the system is under heavy > load and hitting the initiator drivers tagging limits. OTOH it does increase completion latency, which might be the preference for some workloads. Do we need a tunable here? > +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd, > + int cpu, struct workqueue_struct *wq) > { > - struct se_cmd *cmd = container_of(work, struct se_cmd, work); > + llist_add(&se_cmd->se_cmd_list, &q->cmd_list); > + queue_work_on(cpu, wq, &q->work); > +} Do we need this helper at all? Having it open coded in the two callers would seem easier to follow to me.