Re: Poll CQ syncing problem

Sagi Grimberg <sagi@xxxxxxxxxxx> · Wed, 1 Mar 2017 18:44:58 +0200

Hi Christoph, Sagi

Hi Noa,

Analysis:
Since ib_comp_wq isn't single threaded, two works can run in parallel for the same CQ,
executing __ib_process_cq.

How is that even possible? AFAICT a given CQ cannot run more than a
single work item at a time because we simply queue a work when we get
a completion event and rearm it only when we fully drain it. we requeue
if we exhausted our budget but still I don't see a mutual exclusion
violation...

Am I missing anything?

Since this function isn't thread safe and the wc array is shared, it causes a data corruption
which eventually crashes in the MAD layer due to a double list_del of the same element.

Hmm. I'm wandering if this is really the root-cause... Can it be the
fact that ib_comp_wq is unbound causing the worker to migrate cpu cores
in its lifetime?

I wanted to change that a while ago and sent a patch for it [1].

We have the following options to solve this:
1. Instead of cq->wc, allocate an ib_wc array in __ib_process_cq per each call.

That is bad practice.

2. Make ib_comp_wq a single thread workqueue.

Not going to happen, it'll kill performance.

3. Change the locking scheme during poll: Currently only the device's poll_cq implementation
   is done under lock. Change it to also contan the callbacks.

I don't see a need for this at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html