RE: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe

"Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@xxxxxxxxxxx> · Mon, 7 Nov 2022 08:21:57 +0000

On Sun, Nov 6, 2022 6:15 AM Bob Pearson
> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
> > On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
> >> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> >>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> >>>> This patch series implements work queues as an alternative for
> >>>> the main tasklets in the rdma_rxe driver. The patch series starts
> >>>> with a patch that makes the internal API for task execution pluggable
> >>>> and implements an inline and a tasklet based set of functions.
> >>>> The remaining patches cleanup the qp reset and error code in the
> >>>> three tasklets and modify the locking logic to prevent making
> >>>> multiple calls to the tasklet scheduling routine. After
> >>>> this preparation the work queue equivalent set of functions is
> >>>> added and the tasklet version is dropped.
> >>>
> >>> Thank you for posting the 3rd series.
> >>> It looks fine at a glance, but now I am concerned about problems
> >>> that can be potentially caused by concurrency.
> >>>
> >>>>
> >>>> The advantages of the work queue version of deferred task execution
> >>>> is mainly that the work queue variant has much better scalability
> >>>> and overall performance than the tasklet variant.  The perftest
> >>>> microbenchmarks in local loopback mode (not a very realistic test
> >>>> case) can reach approximately 100Gb/sec with work queues compared to
> >>>> about 16Gb/sec for tasklets.
> >>>
> >>> As you wrote, the advantage of work queue version is that the number works
> >>> that can run parallelly scales with the number of logical CPUs. However, the
> >>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> >>> designed for serial execution on tasklet, so we must not rely on them functioning
> >>> properly on parallel execution.
> >>
> >> Work queues are serial for each separate work task just like tasklets. There isn't
> >> a problem here. The tasklets for different tasks can run in parallel but tend to
> >> do so less than work queue tasks. The reason is that tasklets are scheduled by
> >> default on the same cpu as the thread that scheduled it while work queues are scheduled
> >> by the kernel scheduler and get spread around.
> >
> > =====
> > rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
> > =====
> > You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
> > the system scheduler, but each work is still enqueued to worker pools of each CPU
> > and thus bound to the CPU the issuer is running on. It seems the behaviour you
> > expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
> > on any CPU at the cost of cache utilization.
> >
> > Two of the same tasklets never run concurrently on two different processors by nature,
> > but that is not the case with work queues. If two softirqs running on different CPUs
> > enqueue responder works at almost the same time, it is possible that they are dispatched
> > and run on the different CPUs at the same time. I mean the problems may arise in such
> > a situation.
> >
> > Please let me know if I missed anything. I referred to the following document.
> > The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
> > run serially.
> > cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
> >
> > Thanks,
> > Daisuke
> >
> According to this:
> 
>     Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>     after a work item gets queued:
> 
>         The work function hasn’t been changed.
> 
>         No one queues the work item to another workqueue.
> 
>         The work item hasn’t been reinitiated.
> 
>     In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>     most one worker system-wide at any given time.
> 
>     Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>     conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>     inside a work function.
> 
> I should be OK. Each work item checks the state under lock before scheduling the item and
> if it is free moves it to busy and then schedules it. Only one instance of a work item
> at a time should be running.

Thank you for the explanation.
Per-qp work items should meet the three conditions. That is what I have missing.
Now I see. You are correct.

> 
> I only know what I see from running top. It seems that the work items do get spread out over
> time on the cpus.

It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
workers in the same way. This is not stated explicitly in the document.

> 
> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
> 100% for extended periods of time. We are benchmarking storage with IOR.

It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
flag bothers other rxe users.

Thanks,
Daisuke

> 
> Bob
> 
> >>>
> >>> There could be 3 problems, which stem from the fact that works are not necessarily
> >>> executed in the same order the packets are received. Works are enqueued to worker
> >>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
> >>> of works among CPUs is not guaranteed.
> >>>
> >>> [1]
> >>> On UC/UD connections, responder does not check the psn of inbound packets,
> >>> so the payloads can be copied to MRs without checking the order. If there are
> >>> works that write to overlapping memory locations, they can potentially cause
> >>> data corruption depending the order.
> >>>
> >>> [2]
> >>> On RC connections, responder checks the psn, and drops the packet if it is not
> >>> the expected one. Requester can retransmit the request in this case, so the order
> >>> seems to be guaranteed for RC.
> >>>
> >>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
> >>> replying an ACK packet. If the work is preempted soon after storing the next psn,
> >>> another work on another CPU can potentially reply another ACK packet earlier.
> >>> This behaviour is against the spec.
> >>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> >>>
> >>> [3]
> >>> Again on RC connections, the next expected psn (qp->resp.psn) can be
> >>> loaded and stored at the same time from different threads. It seems we
> >>> have to use a synchronization method, perhaps like READ_ONCE() and
> >>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> >>> example; there can be other variables that need similar consideration.
> >>>
> >>>
> >>> All the problems above can be solved by making the work queue single-
> >>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> >>> for alloc_workqueue(), but this should be the last resort since this spoils
> >>> the performance benefit of work queue.
> >>>
> >>> I am not sure what we can do with [1] right now.
> >>> For [2] and [3], we could just move the update of psn later than the ack reply,
> >>> and use *_ONCE() macros for shared variables.
> >>>
> >>> Thanks,
> >>> Daisuke
> >>>
> >>>>
> >>>> This version of the patch series drops the tasklet version as an option
> >>>> but keeps the option of switching between the workqueue and inline
> >>>> versions.
> >>>>
> >>>> This patch series is derived from an earlier patch set developed by
> >>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> >>>> to Lustre servers with hard RoCE v2 NICs.
> >>>>
> >>>> It is based on the current version of wip/jgg-for-next.
> >>>>
> >>>> v3:
> >>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@xxxxxxxxx/
> >>>> The v3 version drops the first few patches which have already been accepted
> >>>> in for-next. It also drops the last patch of the v2 version which
> >>>> introduced module parameters to select between the task interfaces. It also
> >>>> drops the tasklet version entirely. It fixes a minor error caught by
> >>>> the kernel test robot <lkp@xxxxxxxxx> with a missing static declaration.
> >>>>
> >>>> v2:
> >>>> The v2 version of the patch set has some minor changes that address
> >>>> comments from Leon Romanovsky regarding locking of the valid parameter
> >>>> and the setup parameters for alloc_workqueue. It also has one
> >>>> additional cleanup patch.
> >>>>
> >>>> Bob Pearson (13):
> >>>>   RDMA/rxe: Make task interface pluggable
> >>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
> >>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
> >>>>   RDMA/rxe: Handle qp error in rxe_resp.c
> >>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
> >>>>   RDMA/rxe: Remove __rxe_do_task()
> >>>>   RDMA/rxe: Make tasks schedule each other
> >>>>   RDMA/rxe: Implement disable/enable_task()
> >>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
> >>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
> >>>>   RDMA/rxe: Add workqueue support for tasks
> >>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
> >>>>   RDMA/rxe: Remove tasklets from rxe_task.c
> >>>>
> >>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
> >>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
> >>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
> >>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
> >>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
> >>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
> >>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
> >>>>  7 files changed, 329 insertions(+), 172 deletions(-)
> >>>>
> >>>>
> >>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> >>>> --
> >>>> 2.34.1
> >>>
> >