Re: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe

Bob Pearson <rpearsonhpe@xxxxxxxxx> · Mon, 7 Nov 2022 10:06:16 -0600

On 11/7/22 02:21, Daisuke Matsuda (Fujitsu) wrote:
> On Sun, Nov 6, 2022 6:15 AM Bob Pearson
>> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
>>> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
>>>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
>>>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>>>>>> This patch series implements work queues as an alternative for
>>>>>> the main tasklets in the rdma_rxe driver. The patch series starts
>>>>>> with a patch that makes the internal API for task execution pluggable
>>>>>> and implements an inline and a tasklet based set of functions.
>>>>>> The remaining patches cleanup the qp reset and error code in the
>>>>>> three tasklets and modify the locking logic to prevent making
>>>>>> multiple calls to the tasklet scheduling routine. After
>>>>>> this preparation the work queue equivalent set of functions is
>>>>>> added and the tasklet version is dropped.
>>>>>
>>>>> Thank you for posting the 3rd series.
>>>>> It looks fine at a glance, but now I am concerned about problems
>>>>> that can be potentially caused by concurrency.
>>>>>
>>>>>>
>>>>>> The advantages of the work queue version of deferred task execution
>>>>>> is mainly that the work queue variant has much better scalability
>>>>>> and overall performance than the tasklet variant.  The perftest
>>>>>> microbenchmarks in local loopback mode (not a very realistic test
>>>>>> case) can reach approximately 100Gb/sec with work queues compared to
>>>>>> about 16Gb/sec for tasklets.
>>>>>
>>>>> As you wrote, the advantage of work queue version is that the number works
>>>>> that can run parallelly scales with the number of logical CPUs. However, the
>>>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
>>>>> designed for serial execution on tasklet, so we must not rely on them functioning
>>>>> properly on parallel execution.
>>>>
>>>> Work queues are serial for each separate work task just like tasklets. There isn't
>>>> a problem here. The tasklets for different tasks can run in parallel but tend to
>>>> do so less than work queue tasks. The reason is that tasklets are scheduled by
>>>> default on the same cpu as the thread that scheduled it while work queues are scheduled
>>>> by the kernel scheduler and get spread around.
>>>
>>> =====
>>> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
>>> =====
>>> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
>>> the system scheduler, but each work is still enqueued to worker pools of each CPU
>>> and thus bound to the CPU the issuer is running on. It seems the behaviour you
>>> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
>>> on any CPU at the cost of cache utilization.
>>>
>>> Two of the same tasklets never run concurrently on two different processors by nature,
>>> but that is not the case with work queues. If two softirqs running on different CPUs
>>> enqueue responder works at almost the same time, it is possible that they are dispatched
>>> and run on the different CPUs at the same time. I mean the problems may arise in such
>>> a situation.
>>>
>>> Please let me know if I missed anything. I referred to the following document.
>>> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
>>> run serially.
>>> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
>>>
>>> Thanks,
>>> Daisuke
>>>
>> According to this:
>>
>>     Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>>     after a work item gets queued:
>>
>>         The work function hasn’t been changed.
>>
>>         No one queues the work item to another workqueue.
>>
>>         The work item hasn’t been reinitiated.
>>
>>     In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>>     most one worker system-wide at any given time.
>>
>>     Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>>     conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>>     inside a work function.
>>
>> I should be OK. Each work item checks the state under lock before scheduling the item and
>> if it is free moves it to busy and then schedules it. Only one instance of a work item
>> at a time should be running.
> 
> Thank you for the explanation.
> Per-qp work items should meet the three conditions. That is what I have missing.
> Now I see. You are correct.
> 
>>
>> I only know what I see from running top. It seems that the work items do get spread out over
>> time on the cpus.
> 
> It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
> workers in the same way. This is not stated explicitly in the document.
> 
>>
>> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
>> 100% for extended periods of time. We are benchmarking storage with IOR.
> 
> It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
> flag bothers other rxe users.
> 
> Thanks,
> Daisuke
> 
>>
>> Bob
>>
>>>>>
>>>>> There could be 3 problems, which stem from the fact that works are not necessarily
>>>>> executed in the same order the packets are received. Works are enqueued to worker
>>>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
>>>>> of works among CPUs is not guaranteed.
>>>>>
>>>>> [1]
>>>>> On UC/UD connections, responder does not check the psn of inbound packets,
>>>>> so the payloads can be copied to MRs without checking the order. If there are
>>>>> works that write to overlapping memory locations, they can potentially cause
>>>>> data corruption depending the order.
>>>>>
>>>>> [2]
>>>>> On RC connections, responder checks the psn, and drops the packet if it is not
>>>>> the expected one. Requester can retransmit the request in this case, so the order
>>>>> seems to be guaranteed for RC.
>>>>>
>>>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
>>>>> replying an ACK packet. If the work is preempted soon after storing the next psn,
>>>>> another work on another CPU can potentially reply another ACK packet earlier.
>>>>> This behaviour is against the spec.
>>>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
>>>>>
>>>>> [3]
>>>>> Again on RC connections, the next expected psn (qp->resp.psn) can be
>>>>> loaded and stored at the same time from different threads. It seems we
>>>>> have to use a synchronization method, perhaps like READ_ONCE() and
>>>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
>>>>> example; there can be other variables that need similar consideration.
>>>>>
>>>>>
>>>>> All the problems above can be solved by making the work queue single-
>>>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
>>>>> for alloc_workqueue(), but this should be the last resort since this spoils
>>>>> the performance benefit of work queue.
>>>>>
>>>>> I am not sure what we can do with [1] right now.
>>>>> For [2] and [3], we could just move the update of psn later than the ack reply,
>>>>> and use *_ONCE() macros for shared variables.
>>>>>
>>>>> Thanks,
>>>>> Daisuke

Thank you for taking the time to review this.

Bob
>>>>>
>>>>>>
>>>>>> This version of the patch series drops the tasklet version as an option
>>>>>> but keeps the option of switching between the workqueue and inline
>>>>>> versions.
>>>>>>
>>>>>> This patch series is derived from an earlier patch set developed by
>>>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>>>>>> to Lustre servers with hard RoCE v2 NICs.
>>>>>>
>>>>>> It is based on the current version of wip/jgg-for-next.
>>>>>>
>>>>>> v3:
>>>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@xxxxxxxxx/
>>>>>> The v3 version drops the first few patches which have already been accepted
>>>>>> in for-next. It also drops the last patch of the v2 version which
>>>>>> introduced module parameters to select between the task interfaces. It also
>>>>>> drops the tasklet version entirely. It fixes a minor error caught by
>>>>>> the kernel test robot <lkp@xxxxxxxxx> with a missing static declaration.
>>>>>>
>>>>>> v2:
>>>>>> The v2 version of the patch set has some minor changes that address
>>>>>> comments from Leon Romanovsky regarding locking of the valid parameter
>>>>>> and the setup parameters for alloc_workqueue. It also has one
>>>>>> additional cleanup patch.
>>>>>>
>>>>>> Bob Pearson (13):
>>>>>>   RDMA/rxe: Make task interface pluggable
>>>>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>>>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>>>>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>>>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>>>>>   RDMA/rxe: Remove __rxe_do_task()
>>>>>>   RDMA/rxe: Make tasks schedule each other
>>>>>>   RDMA/rxe: Implement disable/enable_task()
>>>>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>>>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>>>>>   RDMA/rxe: Add workqueue support for tasks
>>>>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>>>>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>>>>>
>>>>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>>>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>>>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>>>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>>>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>>>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>>>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>>>>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>>>>>
>>>>>>
>>>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>>>>>> --
>>>>>> 2.34.1
>>>>>
>>>
>