Re: [PATCH v9 06/25] RDMA/rtrs: client: main functionality

Bart Van Assche <bvanassche@xxxxxxx> · Tue, 3 Mar 2020 08:04:09 -0800

On 3/2/20 5:20 AM, Danil Kipnis wrote:
On Sun, Mar 1, 2020 at 2:33 AM Bart Van Assche <bvanassche@xxxxxxx> wrote:
On 2020-02-21 02:47, Jack Wang wrote:
+static struct rtrs_permit *
+__rtrs_get_permit(struct rtrs_clt *clt, enum rtrs_clt_con_type con_type)
+{
+     size_t max_depth = clt->queue_depth;
+     struct rtrs_permit *permit;
+     int cpu, bit;
+
+     /* Combined with cq_vector, we pin the IO to the the cpu it comes */

This comment is confusing. Please clarify this comment. All I see below
is that preemption is disabled. I don't see pinning of I/O to the CPU of
the caller.
The comment is addressing a use-case of the driver: The user can
assign (under /proc/irq/) the irqs of the HCA cq_vectors "one-to-one"
to each cpu. This will "force" the driver to process io response on
the same cpu the io has been submitted on.
In the code below only preemption is disabled. This can lead to the
situation that callers from different cpus will grab the same bit,
since find_first_zero_bit is not atomic. But then the
test_and_set_bit_lock will fail for all the callers but one, so that
they will loop again. This way an explicit spinlock is not required.
Will extend the comment.

If the purpose of get_cpu() and put_cpu() calls is to serialize code 
against other threads, please use locking instead of disabling 
preemption. This will help tools that verify locking like lockdep and 
the kernel thread sanitizer (https://github.com/google/ktsan/wiki).

+static int rtrs_post_send_rdma(struct rtrs_clt_con *con,
+                             struct rtrs_clt_io_req *req,
+                             struct rtrs_rbuf *rbuf, u32 off,
+                             u32 imm, struct ib_send_wr *wr)
+{
+     struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+     enum ib_send_flags flags;
+     struct ib_sge sge;
+
+     if (unlikely(!req->sg_size)) {
+             rtrs_wrn(con->c.sess,
+                      "Doing RDMA Write failed, no data supplied\n");
+             return -EINVAL;
+     }
+
+     /* user data and user message in the first list element */
+     sge.addr   = req->iu->dma_addr;
+     sge.length = req->sg_size;
+     sge.lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+     /*
+      * From time to time we have to post signalled sends,
+      * or send queue will fill up and only QP reset can help.
+      */
+     flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+                     0 : IB_SEND_SIGNALED;
+
+     ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+                                   req->sg_size, DMA_TO_DEVICE);
+
+     return rtrs_iu_post_rdma_write_imm(&con->c, req->iu, &sge, 1,
+                                         rbuf->rkey, rbuf->addr + off,
+                                         imm, flags, wr);
+}

I don't think that posting a signalled send from time to time is
sufficient to prevent send queue overflow. Please address Jason's
comment from January 7th: "Not quite. If the SQ depth is 16 and you post
16 things and then signal the last one, you *cannot* post new work until
you see the completion. More SQ space *ONLY* becomes available upon
receipt of a completion. This is why you can't have an unsignaled SQ."

See also https://lore.kernel.org/linux-rdma/20200107182528.GB26174@xxxxxxxx/
In our case we set the send queue of each QP belonging to one
"session" to the one supported by the hardware (max_qp_wr) which is
around 5K on our hardware. The queue depth of our "session" is 512.
Those 512 are "shared" by all the QPs (number of CPUs on client side)
belonging to that session. So we have at most 512 and 512/num_cpus on
average inflights on each QP. We never experienced send queue full
event in any of our performance tests or production usage. The
alternative would be to count submitted requests and completed
requests, check the difference before submission and wait if the
difference multiplied by the queue depth of "session" exceeds the max
supported by the hardware. The check will require quite some code and
will most probably affect performance. I do not think it is worth it
to introduce a code path which is triggered only on a condition which
is known to never become true.
Jason, do you think it's necessary to implement such tracking?

Please either make sure that send queues do not overflow by providing 
enough space for 512 in-flight requests fit or implement tracking for 
the number of in-flight requests.

Thanks,

Bart.