Re: [PATCH v4 17/25] ibnbd: client: main functionality

Bart Van Assche <bvanassche@xxxxxxx> · Mon, 16 Sep 2019 09:46:22 -0700

On 9/16/19 7:17 AM, Danil Kipnis wrote:
On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche <bvanassche@xxxxxxx> wrote:
On 6/20/19 8:03 AM, Jack Wang wrote:
+/*
+ * This is for closing devices when unloading the module:
+ * we might be closing a lot (>256) of devices in parallel
+ * and it is better not to use the system_wq.
+ */
+static struct workqueue_struct *unload_wq;

I think that a better motivation is needed for the introduction of a new
workqueue.
>
We didn't want to pollute the system workqueue when unmapping a big
number of devices at once in parallel. Will reiterate on it.

There are multiple system workqueues. From <linux/workqueue.h>:

extern struct workqueue_struct *system_wq;
extern struct workqueue_struct *system_highpri_wq;
extern struct workqueue_struct *system_long_wq;
extern struct workqueue_struct *system_unbound_wq;
extern struct workqueue_struct *system_freezable_wq;
extern struct workqueue_struct *system_power_efficient_wq;
extern struct workqueue_struct *system_freezable_power_efficient_wq;

Has it been considered to use e.g. system_long_wq?

A more general question is why ibnbd needs its own queue management
while no other block driver needs this?

Each IBNBD device promises to have a queue_depth (of say 512) on each
of its num_cpus hardware queues. In fact we can only process a
queue_depth inflights at once on the whole ibtrs session connecting a
given client with a given server. Those 512 inflights (corresponding
to the number of buffers reserved by the server for this particular
client) have to be shared among all the devices mapped on this
session. This leads to the situation, that we receive more requests
than we can process at the moment. So we need to stop queues and start
them again later in some fair fashion.

Can a single CPU really sustain a queue depth of 512 commands? Is it 
really necessary to have one hardware queue per CPU or is e.g. four 
queues per NUMA node sufficient? Has it been considered to send the 
number of hardware queues that the initiator wants to use and also the 
command depth per queue during login to the target side? That would 
allow the target side to allocate an independent set of buffers for each 
initiator hardware queue and would allow to remove the queue management 
at the initiator side. This might even yield better performance.

+static void msg_conf(void *priv, int errno)
+{
+     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;

The kernel code I'm familiar with does not cast void pointers explicitly
into another type. Please follow that convention and leave the cast out
from the above and also from similar statements.
msg_conf() is a callback which IBNBD passes down with a request to
IBTRS when calling ibtrs_clt_request(). msg_conf() is called when a
request is completed with a pointer to a struct defined in IBNBD. So
IBTRS as transport doesn't know what's inside the private pointer
which IBNBD passed down with the request, it's opaque, since struct
ibnbd_iu is not visible in IBTRS. I will try to find how others avoid
a cast in similar situations.

Are you aware that the C language can cast a void pointer into a 
non-void pointer implicitly, that means, without having to use a cast?

+static void wait_for_ibtrs_disconnection(struct ibnbd_clt_session *sess)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+     DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
+
+     prepare_to_wait(&sess->ibtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
+     if (IS_ERR_OR_NULL(sess->ibtrs)) {
+             finish_wait(&sess->ibtrs_waitq, &wait);
+             return;
+     }
+     mutex_unlock(&sess_lock);
+     /* After unlock session can be freed, so careful */
+     schedule();
+     mutex_lock(&sess_lock);
+}

This doesn't look right: any random wake_up() call can wake up this
function. Shouldn't there be a loop in this function that causes the
schedule() call to be repeated until the disconnect has happened?
The loop is inside __find_and_get_sess(), which is calling that
function. We need to schedule() here in order for another thread to be
able to remove the dying session we just found and tried to get
reference to from the list of sessions, so that we can go over the
list again in __find_and_get_sess().

Thanks for the clarification.

Bart.