Re: [PATCH 11/13] target: replace work per cmd in completion path

Mike Christie <michael.christie@xxxxxxxxxx> · Wed, 10 Feb 2021 12:33:34 -0600

On 2/10/21 2:42 AM, Christoph Hellwig wrote:
> On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote:
>> Doing a work per cmd can lead to lots of threads being created.
>> This patch just replaces the completion work per cmd with a per cpu
>> list. Combined with the first patches this allows tcm loop on top of
>> initiators like iser to go from around 700K IOPs to 1000K and reduces
>> the number of threads that get created when the system is under heavy
>> load and hitting the initiator drivers tagging limits.
> 
> OTOH it does increase completion latency, which might be the preference
> for some workloads.  Do we need a tunable here?
> 

I didn't see an increase with higher perf setups like 100G (really 60
somehting) iser/srp and even using ram disks. I think it's because either
way we run one cmd on each cpu at a time, or if we do kick off cmds on
different threads they still run on the same cpu so latency didn't change
overall.

1. Without my patch each cmd has a work struct. target_complete_cmd does
queue_work_on which puts the cmd's work on the worker pool for that cpu.
With faster drivers, workqueue.c does the equivalent of a loop over each
work/cmd running one work on the cpu, that completes, then running the next.

If we get to the point where the workqueue code runs cmds in different threads
on the same cpu, they end up sharing the cpu so latency wasn't changing
(I didn't debug this very much).

I can't tell how much the rescuer thread comes into this though. Maybe it
could come in and run cmds on a different cpu, and that would help.

2. With the patch, we add the cmd to the device's per cpu list, then do
queue_work_on on the list's per cpu work. queue_work_on then either hits
the pending check or we run like #1 but on a per device basis instead of
per cmd.

However,

1. While I was checking this I noticed for commands that set
transport_complete_callback, then I need to do the per cmd work, since
they can be slow. I'll fix this on the next posting.

2. For the latency issue, did you want the cmd to be able to run on any cpu?
I think this won't happen now under normal cases, because the workqueue does
not set WQ_UNBOUND. Did we want a tunable for that?

>> +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd,
>> +				  int cpu, struct workqueue_struct *wq)
>>  {
>> -	struct se_cmd *cmd = container_of(work, struct se_cmd, work);
>> +	llist_add(&se_cmd->se_cmd_list, &q->cmd_list);
>> +	queue_work_on(cpu, wq, &q->work);
>> +}
> 
> Do we need this helper at all?  Having it open coded in the two callers
> would seem easier to follow to me.

I can do that.