On 10/10/23 09:09, Jason Gunthorpe wrote:
On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote:
Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks"
I see this is supported by Zhu, Bart and approved by Leon.
Solution 2: Serializing execution of work items
- rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
+ rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);
Solution 3: Merging requester and completer (not yet submitted/tested)
https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@xxxxxxxxx/
Not clear to me if we should call this a new feature or a fix.
If it can eliminate the hang issue, it could be an ultimate solution.
It is understandable some people do not want to wait for solution 3 to be submitted and verified.
Is there any problem if we adopt solution 2?
If so, then I agree to going with solution 1.
If not, solution 2 is better to me.
I also do not want to go backwards, I don't believe the locking is
magically correct under tasklets. 2 is painful enough to continue to
motivate people to fix this while unbreaking block tests.
In my opinion (2) is not a solution. Zhu Yanjun reported test failures with
rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1). Adding WQ_HIGHPRI probably
made it less likely to trigger any race conditions but I don't believe that
this is sufficient as a solution.
I'm still puzzled why Bob can't reproduce the things Bart has seen.
Is this necessary? The KASAN complaint that I reported should be more than
enough for someone who is familiar with the RXE driver to identify and fix
the root cause. I can help with testing candidate fixes.
Thanks,
Bart.