Re: Poll CQ syncing problem

Noa Osherovich <noaos@xxxxxxxxxxxx> · Thu, 2 Mar 2017 08:04:38 +0200

Hi Sagi,

On 3/1/2017 6:44 PM, Sagi Grimberg wrote:

>> Hi Christoph, Sagi
> Hi Noa,
>> Analysis:
>> Since ib_comp_wq isn't single threaded, two works can run in parallel for the same CQ,
>> executing __ib_process_cq.
>>
> How is that even possible? AFAICT a given CQ cannot run more than a
> single work item at a time because we simply queue a work when we get
> a completion event and rearm it only when we fully drain it. we requeue
> if we exhausted our budget but still I don't see a mutual exclusion
> violation...
> Am I missing anything?

As Christoph and Bart pointed out, we use older kernel versions.

>> Since this function isn't thread safe and the wc array is shared, it causes a data corruption
>> which eventually crashes in the MAD layer due to a double list_del of the same element.
> Hmm. I'm wandering if this is really the root-cause... Can it be the
> fact that ib_comp_wq is unbound causing the worker to migrate cpu cores
> in its lifetime?
>
> I wanted to change that a while ago and sent a patch for it [1].
>
>> We have the following options to solve this:
>> 1. Instead of cq->wc, allocate an ib_wc array in __ib_process_cq per each call.
> That is bad practice.
>
>> 2. Make ib_comp_wq a single thread workqueue.
> Not going to happen, it'll kill performance.
>> 3. Change the locking scheme during poll: Currently only the device's poll_cq implementation
>>    is done under lock. Change it to also contan the callbacks.
> I don't see a need for this at all.

Thanks for your input.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html