Hi Sagi, On 3/1/2017 6:44 PM, Sagi Grimberg wrote: >> Hi Christoph, Sagi > Hi Noa, >> Analysis: >> Since ib_comp_wq isn't single threaded, two works can run in parallel for the same CQ, >> executing __ib_process_cq. >> > How is that even possible? AFAICT a given CQ cannot run more than a > single work item at a time because we simply queue a work when we get > a completion event and rearm it only when we fully drain it. we requeue > if we exhausted our budget but still I don't see a mutual exclusion > violation... > Am I missing anything? As Christoph and Bart pointed out, we use older kernel versions. >> Since this function isn't thread safe and the wc array is shared, it causes a data corruption >> which eventually crashes in the MAD layer due to a double list_del of the same element. > Hmm. I'm wandering if this is really the root-cause... Can it be the > fact that ib_comp_wq is unbound causing the worker to migrate cpu cores > in its lifetime? > > I wanted to change that a while ago and sent a patch for it [1]. > >> We have the following options to solve this: >> 1. Instead of cq->wc, allocate an ib_wc array in __ib_process_cq per each call. > That is bad practice. > >> 2. Make ib_comp_wq a single thread workqueue. > Not going to happen, it'll kill performance. >> 3. Change the locking scheme during poll: Currently only the device's poll_cq implementation >> is done under lock. Change it to also contan the callbacks. > I don't see a need for this at all. Thanks for your input. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html