On Tue, Mar 3, 2020 at 5:04 PM Bart Van Assche <bvanassche@xxxxxxx> wrote: > > On 3/2/20 5:20 AM, Danil Kipnis wrote: > > On Sun, Mar 1, 2020 at 2:33 AM Bart Van Assche <bvanassche@xxxxxxx> wrote: > >> On 2020-02-21 02:47, Jack Wang wrote: > >>> +static struct rtrs_permit * > >>> +__rtrs_get_permit(struct rtrs_clt *clt, enum rtrs_clt_con_type con_type) > >>> +{ > >>> + size_t max_depth = clt->queue_depth; > >>> + struct rtrs_permit *permit; > >>> + int cpu, bit; > >>> + > >>> + /* Combined with cq_vector, we pin the IO to the the cpu it comes */ > >> > >> This comment is confusing. Please clarify this comment. All I see below > >> is that preemption is disabled. I don't see pinning of I/O to the CPU of > >> the caller. > > The comment is addressing a use-case of the driver: The user can > > assign (under /proc/irq/) the irqs of the HCA cq_vectors "one-to-one" > > to each cpu. This will "force" the driver to process io response on > > the same cpu the io has been submitted on. > > In the code below only preemption is disabled. This can lead to the > > situation that callers from different cpus will grab the same bit, > > since find_first_zero_bit is not atomic. But then the > > test_and_set_bit_lock will fail for all the callers but one, so that > > they will loop again. This way an explicit spinlock is not required. > > Will extend the comment. > > If the purpose of get_cpu() and put_cpu() calls is to serialize code > against other threads, please use locking instead of disabling > preemption. This will help tools that verify locking like lockdep and > the kernel thread sanitizer (https://github.com/google/ktsan/wiki). We can look into it, but I'm afraid converting to spinlock might have a performance impact. > > >>> +static int rtrs_post_send_rdma(struct rtrs_clt_con *con, > >>> + struct rtrs_clt_io_req *req, > >>> + struct rtrs_rbuf *rbuf, u32 off, > >>> + u32 imm, struct ib_send_wr *wr) > >>> +{ > >>> + struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess); > >>> + enum ib_send_flags flags; > >>> + struct ib_sge sge; > >>> + > >>> + if (unlikely(!req->sg_size)) { > >>> + rtrs_wrn(con->c.sess, > >>> + "Doing RDMA Write failed, no data supplied\n"); > >>> + return -EINVAL; > >>> + } > >>> + > >>> + /* user data and user message in the first list element */ > >>> + sge.addr = req->iu->dma_addr; > >>> + sge.length = req->sg_size; > >>> + sge.lkey = sess->s.dev->ib_pd->local_dma_lkey; > >>> + > >>> + /* > >>> + * From time to time we have to post signalled sends, > >>> + * or send queue will fill up and only QP reset can help. > >>> + */ > >>> + flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ? > >>> + 0 : IB_SEND_SIGNALED; > >>> + > >>> + ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr, > >>> + req->sg_size, DMA_TO_DEVICE); > >>> + > >>> + return rtrs_iu_post_rdma_write_imm(&con->c, req->iu, &sge, 1, > >>> + rbuf->rkey, rbuf->addr + off, > >>> + imm, flags, wr); > >>> +} > >> > >> I don't think that posting a signalled send from time to time is > >> sufficient to prevent send queue overflow. Please address Jason's > >> comment from January 7th: "Not quite. If the SQ depth is 16 and you post > >> 16 things and then signal the last one, you *cannot* post new work until > >> you see the completion. More SQ space *ONLY* becomes available upon > >> receipt of a completion. This is why you can't have an unsignaled SQ." > > > >> See also https://lore.kernel.org/linux-rdma/20200107182528.GB26174@xxxxxxxx/ > > In our case we set the send queue of each QP belonging to one > > "session" to the one supported by the hardware (max_qp_wr) which is > > around 5K on our hardware. The queue depth of our "session" is 512. > > Those 512 are "shared" by all the QPs (number of CPUs on client side) > > belonging to that session. So we have at most 512 and 512/num_cpus on > > average inflights on each QP. We never experienced send queue full > > event in any of our performance tests or production usage. The > > alternative would be to count submitted requests and completed > > requests, check the difference before submission and wait if the > > difference multiplied by the queue depth of "session" exceeds the max > > supported by the hardware. The check will require quite some code and > > will most probably affect performance. I do not think it is worth it > > to introduce a code path which is triggered only on a condition which > > is known to never become true. > > Jason, do you think it's necessary to implement such tracking? > > Please either make sure that send queues do not overflow by providing > enough space for 512 in-flight requests fit or implement tracking for > the number of in-flight requests. We do have enough space for send queue. > > Thanks, > > Bart. > Thanks Bart!