Hi Christoph, Sagi
Hi Noa,
Analysis: Since ib_comp_wq isn't single threaded, two works can run in parallel for the same CQ, executing __ib_process_cq.
How is that even possible? AFAICT a given CQ cannot run more than a single work item at a time because we simply queue a work when we get a completion event and rearm it only when we fully drain it. we requeue if we exhausted our budget but still I don't see a mutual exclusion violation... Am I missing anything?
Since this function isn't thread safe and the wc array is shared, it causes a data corruption which eventually crashes in the MAD layer due to a double list_del of the same element.
Hmm. I'm wandering if this is really the root-cause... Can it be the fact that ib_comp_wq is unbound causing the worker to migrate cpu cores in its lifetime? I wanted to change that a while ago and sent a patch for it [1].
We have the following options to solve this: 1. Instead of cq->wc, allocate an ib_wc array in __ib_process_cq per each call.
That is bad practice.
2. Make ib_comp_wq a single thread workqueue.
Not going to happen, it'll kill performance.
3. Change the locking scheme during poll: Currently only the device's poll_cq implementation is done under lock. Change it to also contan the callbacks.
I don't see a need for this at all. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html