On Tue, Oct 10, 2023 at 02:29:19PM -0700, Bart Van Assche wrote: > On 10/10/23 09:09, Jason Gunthorpe wrote: > > On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote: > > > > > Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks" > > > I see this is supported by Zhu, Bart and approved by Leon. > > > > > > Solution 2: Serializing execution of work items > > > > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > > > > + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); > > > > > > Solution 3: Merging requester and completer (not yet submitted/tested) > > > https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@xxxxxxxxx/ > > > Not clear to me if we should call this a new feature or a fix. > > > If it can eliminate the hang issue, it could be an ultimate solution. > > > > > > It is understandable some people do not want to wait for solution 3 to be submitted and verified. > > > Is there any problem if we adopt solution 2? > > > If so, then I agree to going with solution 1. > > > If not, solution 2 is better to me. > > > > I also do not want to go backwards, I don't believe the locking is > > magically correct under tasklets. 2 is painful enough to continue to > > motivate people to fix this while unbreaking block tests. > > In my opinion (2) is not a solution. Zhu Yanjun reported test failures with > rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1). Adding WQ_HIGHPRI probably > made it less likely to trigger any race conditions but I don't believe that > this is sufficient as a solution. I've been going on the assumption that rxe has always been full of bugs. I don't believe the work queue change added new bugs, it just made the existing bugs easier to hit. It is hard to be sure until someon can find out what is going wrong. If we revert it then rxe will probably just stop development entirely. Daisuke's ODP work will be blocked and if Bob was able to fix it he would have done so already. Which mean's Bobs ongoing work is lost too. I *vastly* prefer we root cause and fix it properly. Rxe was finally starting to get a reasonable set of people interested in it, I do not want to kill that off. Again, I'm troubled that this doesn't seem to be reproducing for other people. > > I'm still puzzled why Bob can't reproduce the things Bart has seen. > > Is this necessary? It is always easier to debug something you can change than to try and guess what an oops is trying to say.. > The KASAN complaint that I reported should be more than enough for > someone who is familiar with the RXE driver to identify and fix the > root cause. I can help with testing candidate fixes. Bob? Jason