On Sun, Nov 6, 2022 6:15 AM Bob Pearson > On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote: > > On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote: > >> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote: > >>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote: > >>>> This patch series implements work queues as an alternative for > >>>> the main tasklets in the rdma_rxe driver. The patch series starts > >>>> with a patch that makes the internal API for task execution pluggable > >>>> and implements an inline and a tasklet based set of functions. > >>>> The remaining patches cleanup the qp reset and error code in the > >>>> three tasklets and modify the locking logic to prevent making > >>>> multiple calls to the tasklet scheduling routine. After > >>>> this preparation the work queue equivalent set of functions is > >>>> added and the tasklet version is dropped. > >>> > >>> Thank you for posting the 3rd series. > >>> It looks fine at a glance, but now I am concerned about problems > >>> that can be potentially caused by concurrency. > >>> > >>>> > >>>> The advantages of the work queue version of deferred task execution > >>>> is mainly that the work queue variant has much better scalability > >>>> and overall performance than the tasklet variant. The perftest > >>>> microbenchmarks in local loopback mode (not a very realistic test > >>>> case) can reach approximately 100Gb/sec with work queues compared to > >>>> about 16Gb/sec for tasklets. > >>> > >>> As you wrote, the advantage of work queue version is that the number works > >>> that can run parallelly scales with the number of logical CPUs. However, the > >>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are > >>> designed for serial execution on tasklet, so we must not rely on them functioning > >>> properly on parallel execution. > >> > >> Work queues are serial for each separate work task just like tasklets. There isn't > >> a problem here. The tasklets for different tasks can run in parallel but tend to > >> do so less than work queue tasks. The reason is that tasklets are scheduled by > >> default on the same cpu as the thread that scheduled it while work queues are scheduled > >> by the kernel scheduler and get spread around. > > > > ===== > > rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE); > > ===== > > You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by > > the system scheduler, but each work is still enqueued to worker pools of each CPU > > and thus bound to the CPU the issuer is running on. It seems the behaviour you > > expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run > > on any CPU at the cost of cache utilization. > > > > Two of the same tasklets never run concurrently on two different processors by nature, > > but that is not the case with work queues. If two softirqs running on different CPUs > > enqueue responder works at almost the same time, it is possible that they are dispatched > > and run on the different CPUs at the same time. I mean the problems may arise in such > > a situation. > > > > Please let me know if I missed anything. I referred to the following document. > > The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works > > run serially. > > cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html > > > > Thanks, > > Daisuke > > > According to this: > > Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold > after a work item gets queued: > > The work function hasn’t been changed. > > No one queues the work item to another workqueue. > > The work item hasn’t been reinitiated. > > In other words, if the above conditions hold, the work item is guaranteed to be executed by at > most one worker system-wide at any given time. > > Note that requeuing the work item (to the same queue) in the self function doesn’t break these > conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions > inside a work function. > > I should be OK. Each work item checks the state under lock before scheduling the item and > if it is free moves it to busy and then schedules it. Only one instance of a work item > at a time should be running. Thank you for the explanation. Per-qp work items should meet the three conditions. That is what I have missing. Now I see. You are correct. > > I only know what I see from running top. It seems that the work items do get spread out over > time on the cpus. It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE workers in the same way. This is not stated explicitly in the document. > > The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at > 100% for extended periods of time. We are benchmarking storage with IOR. It is OK with me. I have not come up with any situations where the CPU_INTENSIVE flag bothers other rxe users. Thanks, Daisuke > > Bob > > >>> > >>> There could be 3 problems, which stem from the fact that works are not necessarily > >>> executed in the same order the packets are received. Works are enqueued to worker > >>> pools on each CPU, and each CPU respectively schedules the works, so the ordering > >>> of works among CPUs is not guaranteed. > >>> > >>> [1] > >>> On UC/UD connections, responder does not check the psn of inbound packets, > >>> so the payloads can be copied to MRs without checking the order. If there are > >>> works that write to overlapping memory locations, they can potentially cause > >>> data corruption depending the order. > >>> > >>> [2] > >>> On RC connections, responder checks the psn, and drops the packet if it is not > >>> the expected one. Requester can retransmit the request in this case, so the order > >>> seems to be guaranteed for RC. > >>> > >>> However, responder updates the next expected psn (qp->resp.psn) BEFORE > >>> replying an ACK packet. If the work is preempted soon after storing the next psn, > >>> another work on another CPU can potentially reply another ACK packet earlier. > >>> This behaviour is against the spec. > >>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING" > >>> > >>> [3] > >>> Again on RC connections, the next expected psn (qp->resp.psn) can be > >>> loaded and stored at the same time from different threads. It seems we > >>> have to use a synchronization method, perhaps like READ_ONCE() and > >>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an > >>> example; there can be other variables that need similar consideration. > >>> > >>> > >>> All the problems above can be solved by making the work queue single- > >>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1 > >>> for alloc_workqueue(), but this should be the last resort since this spoils > >>> the performance benefit of work queue. > >>> > >>> I am not sure what we can do with [1] right now. > >>> For [2] and [3], we could just move the update of psn later than the ack reply, > >>> and use *_ONCE() macros for shared variables. > >>> > >>> Thanks, > >>> Daisuke > >>> > >>>> > >>>> This version of the patch series drops the tasklet version as an option > >>>> but keeps the option of switching between the workqueue and inline > >>>> versions. > >>>> > >>>> This patch series is derived from an earlier patch set developed by > >>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached > >>>> to Lustre servers with hard RoCE v2 NICs. > >>>> > >>>> It is based on the current version of wip/jgg-for-next. > >>>> > >>>> v3: > >>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@xxxxxxxxx/ > >>>> The v3 version drops the first few patches which have already been accepted > >>>> in for-next. It also drops the last patch of the v2 version which > >>>> introduced module parameters to select between the task interfaces. It also > >>>> drops the tasklet version entirely. It fixes a minor error caught by > >>>> the kernel test robot <lkp@xxxxxxxxx> with a missing static declaration. > >>>> > >>>> v2: > >>>> The v2 version of the patch set has some minor changes that address > >>>> comments from Leon Romanovsky regarding locking of the valid parameter > >>>> and the setup parameters for alloc_workqueue. It also has one > >>>> additional cleanup patch. > >>>> > >>>> Bob Pearson (13): > >>>> RDMA/rxe: Make task interface pluggable > >>>> RDMA/rxe: Split rxe_drain_resp_pkts() > >>>> RDMA/rxe: Simplify reset state handling in rxe_resp.c > >>>> RDMA/rxe: Handle qp error in rxe_resp.c > >>>> RDMA/rxe: Cleanup comp tasks in rxe_qp.c > >>>> RDMA/rxe: Remove __rxe_do_task() > >>>> RDMA/rxe: Make tasks schedule each other > >>>> RDMA/rxe: Implement disable/enable_task() > >>>> RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE > >>>> RDMA/rxe: Replace task->destroyed by task state INVALID. > >>>> RDMA/rxe: Add workqueue support for tasks > >>>> RDMA/rxe: Make WORKQUEUE default for RC tasks > >>>> RDMA/rxe: Remove tasklets from rxe_task.c > >>>> > >>>> drivers/infiniband/sw/rxe/rxe.c | 9 +- > >>>> drivers/infiniband/sw/rxe/rxe_comp.c | 24 ++- > >>>> drivers/infiniband/sw/rxe/rxe_qp.c | 80 ++++----- > >>>> drivers/infiniband/sw/rxe/rxe_req.c | 4 +- > >>>> drivers/infiniband/sw/rxe/rxe_resp.c | 70 +++++--- > >>>> drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++-------- > >>>> drivers/infiniband/sw/rxe/rxe_task.h | 56 +++--- > >>>> 7 files changed, 329 insertions(+), 172 deletions(-) > >>>> > >>>> > >>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420 > >>>> -- > >>>> 2.34.1 > >>> > >