On 11/7/22 02:21, Daisuke Matsuda (Fujitsu) wrote: > On Sun, Nov 6, 2022 6:15 AM Bob Pearson >> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote: >>> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote: >>>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote: >>>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote: >>>>>> This patch series implements work queues as an alternative for >>>>>> the main tasklets in the rdma_rxe driver. The patch series starts >>>>>> with a patch that makes the internal API for task execution pluggable >>>>>> and implements an inline and a tasklet based set of functions. >>>>>> The remaining patches cleanup the qp reset and error code in the >>>>>> three tasklets and modify the locking logic to prevent making >>>>>> multiple calls to the tasklet scheduling routine. After >>>>>> this preparation the work queue equivalent set of functions is >>>>>> added and the tasklet version is dropped. >>>>> >>>>> Thank you for posting the 3rd series. >>>>> It looks fine at a glance, but now I am concerned about problems >>>>> that can be potentially caused by concurrency. >>>>> >>>>>> >>>>>> The advantages of the work queue version of deferred task execution >>>>>> is mainly that the work queue variant has much better scalability >>>>>> and overall performance than the tasklet variant. The perftest >>>>>> microbenchmarks in local loopback mode (not a very realistic test >>>>>> case) can reach approximately 100Gb/sec with work queues compared to >>>>>> about 16Gb/sec for tasklets. >>>>> >>>>> As you wrote, the advantage of work queue version is that the number works >>>>> that can run parallelly scales with the number of logical CPUs. However, the >>>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are >>>>> designed for serial execution on tasklet, so we must not rely on them functioning >>>>> properly on parallel execution. >>>> >>>> Work queues are serial for each separate work task just like tasklets. There isn't >>>> a problem here. The tasklets for different tasks can run in parallel but tend to >>>> do so less than work queue tasks. The reason is that tasklets are scheduled by >>>> default on the same cpu as the thread that scheduled it while work queues are scheduled >>>> by the kernel scheduler and get spread around. >>> >>> ===== >>> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE); >>> ===== >>> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by >>> the system scheduler, but each work is still enqueued to worker pools of each CPU >>> and thus bound to the CPU the issuer is running on. It seems the behaviour you >>> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run >>> on any CPU at the cost of cache utilization. >>> >>> Two of the same tasklets never run concurrently on two different processors by nature, >>> but that is not the case with work queues. If two softirqs running on different CPUs >>> enqueue responder works at almost the same time, it is possible that they are dispatched >>> and run on the different CPUs at the same time. I mean the problems may arise in such >>> a situation. >>> >>> Please let me know if I missed anything. I referred to the following document. >>> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works >>> run serially. >>> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html >>> >>> Thanks, >>> Daisuke >>> >> According to this: >> >> Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold >> after a work item gets queued: >> >> The work function hasn’t been changed. >> >> No one queues the work item to another workqueue. >> >> The work item hasn’t been reinitiated. >> >> In other words, if the above conditions hold, the work item is guaranteed to be executed by at >> most one worker system-wide at any given time. >> >> Note that requeuing the work item (to the same queue) in the self function doesn’t break these >> conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions >> inside a work function. >> >> I should be OK. Each work item checks the state under lock before scheduling the item and >> if it is free moves it to busy and then schedules it. Only one instance of a work item >> at a time should be running. > > Thank you for the explanation. > Per-qp work items should meet the three conditions. That is what I have missing. > Now I see. You are correct. > >> >> I only know what I see from running top. It seems that the work items do get spread out over >> time on the cpus. > > It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE > workers in the same way. This is not stated explicitly in the document. > >> >> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at >> 100% for extended periods of time. We are benchmarking storage with IOR. > > It is OK with me. I have not come up with any situations where the CPU_INTENSIVE > flag bothers other rxe users. > > Thanks, > Daisuke > >> >> Bob >> >>>>> >>>>> There could be 3 problems, which stem from the fact that works are not necessarily >>>>> executed in the same order the packets are received. Works are enqueued to worker >>>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering >>>>> of works among CPUs is not guaranteed. >>>>> >>>>> [1] >>>>> On UC/UD connections, responder does not check the psn of inbound packets, >>>>> so the payloads can be copied to MRs without checking the order. If there are >>>>> works that write to overlapping memory locations, they can potentially cause >>>>> data corruption depending the order. >>>>> >>>>> [2] >>>>> On RC connections, responder checks the psn, and drops the packet if it is not >>>>> the expected one. Requester can retransmit the request in this case, so the order >>>>> seems to be guaranteed for RC. >>>>> >>>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE >>>>> replying an ACK packet. If the work is preempted soon after storing the next psn, >>>>> another work on another CPU can potentially reply another ACK packet earlier. >>>>> This behaviour is against the spec. >>>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING" >>>>> >>>>> [3] >>>>> Again on RC connections, the next expected psn (qp->resp.psn) can be >>>>> loaded and stored at the same time from different threads. It seems we >>>>> have to use a synchronization method, perhaps like READ_ONCE() and >>>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an >>>>> example; there can be other variables that need similar consideration. >>>>> >>>>> >>>>> All the problems above can be solved by making the work queue single- >>>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1 >>>>> for alloc_workqueue(), but this should be the last resort since this spoils >>>>> the performance benefit of work queue. >>>>> >>>>> I am not sure what we can do with [1] right now. >>>>> For [2] and [3], we could just move the update of psn later than the ack reply, >>>>> and use *_ONCE() macros for shared variables. >>>>> >>>>> Thanks, >>>>> Daisuke Thank you for taking the time to review this. Bob >>>>> >>>>>> >>>>>> This version of the patch series drops the tasklet version as an option >>>>>> but keeps the option of switching between the workqueue and inline >>>>>> versions. >>>>>> >>>>>> This patch series is derived from an earlier patch set developed by >>>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached >>>>>> to Lustre servers with hard RoCE v2 NICs. >>>>>> >>>>>> It is based on the current version of wip/jgg-for-next. >>>>>> >>>>>> v3: >>>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@xxxxxxxxx/ >>>>>> The v3 version drops the first few patches which have already been accepted >>>>>> in for-next. It also drops the last patch of the v2 version which >>>>>> introduced module parameters to select between the task interfaces. It also >>>>>> drops the tasklet version entirely. It fixes a minor error caught by >>>>>> the kernel test robot <lkp@xxxxxxxxx> with a missing static declaration. >>>>>> >>>>>> v2: >>>>>> The v2 version of the patch set has some minor changes that address >>>>>> comments from Leon Romanovsky regarding locking of the valid parameter >>>>>> and the setup parameters for alloc_workqueue. It also has one >>>>>> additional cleanup patch. >>>>>> >>>>>> Bob Pearson (13): >>>>>> RDMA/rxe: Make task interface pluggable >>>>>> RDMA/rxe: Split rxe_drain_resp_pkts() >>>>>> RDMA/rxe: Simplify reset state handling in rxe_resp.c >>>>>> RDMA/rxe: Handle qp error in rxe_resp.c >>>>>> RDMA/rxe: Cleanup comp tasks in rxe_qp.c >>>>>> RDMA/rxe: Remove __rxe_do_task() >>>>>> RDMA/rxe: Make tasks schedule each other >>>>>> RDMA/rxe: Implement disable/enable_task() >>>>>> RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE >>>>>> RDMA/rxe: Replace task->destroyed by task state INVALID. >>>>>> RDMA/rxe: Add workqueue support for tasks >>>>>> RDMA/rxe: Make WORKQUEUE default for RC tasks >>>>>> RDMA/rxe: Remove tasklets from rxe_task.c >>>>>> >>>>>> drivers/infiniband/sw/rxe/rxe.c | 9 +- >>>>>> drivers/infiniband/sw/rxe/rxe_comp.c | 24 ++- >>>>>> drivers/infiniband/sw/rxe/rxe_qp.c | 80 ++++----- >>>>>> drivers/infiniband/sw/rxe/rxe_req.c | 4 +- >>>>>> drivers/infiniband/sw/rxe/rxe_resp.c | 70 +++++--- >>>>>> drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++-------- >>>>>> drivers/infiniband/sw/rxe/rxe_task.h | 56 +++--- >>>>>> 7 files changed, 329 insertions(+), 172 deletions(-) >>>>>> >>>>>> >>>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420 >>>>>> -- >>>>>> 2.34.1 >>>>> >>> >