On 4/15/22 02:12, Yanjun Zhu wrote: > 在 2022/4/10 5:43, Bob Pearson 写道: >> On 4/9/22 00:04, Christoph Hellwig wrote: >>> On Fri, Apr 08, 2022 at 04:25:12PM -0700, Bart Van Assche wrote: >>>> One of the functions in the above call stack is sd_remove(). sd_remove() >>>> calls del_gendisk() just before calling sd_shutdown(). sd_shutdown() >>>> submits the SYNCHRONIZE CACHE command. In del_gendisk() I found the >>>> following comment: "Fail any new I/O". Do you agree that failing new I/O >>>> before sd_shutdown() is called is wrong? Is there any other way to fix this >>>> than moving the blk_queue_start_drain() etc. calls out of del_gendisk() and >>>> into a new function? >>> >>> That SYNCHRONIZE CACHE is a passthrough command sent on the request_queue >>> and should not be affected by stopping all file system I/O. >> >> When I run check -q srp >> all the test cases pass but each one stops for 3+ minutes at synchronize cache. >> The rxe device is still active until sync cache returns when the last QP and the PD >> are destroyed. It may be that the queues are blocked waiting for something else >> even though they have reported success?? > > If you remove all the xarray patches and use the original source code. This will not occur. > > Zhu Yanjun > I know. I am trying to find out why. For performance reasons I very much want to make the xarray + rcu_locking patches work correctly. All the spinlock issues are now fixed in my tree. What is left is a race in the RDMA read retry code somewhere. I'll find it. In the process of chasing this I have found several real bugs and I suspect a few more are out there. Bob