Re: blktest failures

Bob Pearson <rpearsonhpe@xxxxxxxxx> · Fri, 15 Apr 2022 02:26:50 -0500

On 4/15/22 02:12, Yanjun Zhu wrote:
> 在 2022/4/10 5:43, Bob Pearson 写道:
>> On 4/9/22 00:04, Christoph Hellwig wrote:
>>> On Fri, Apr 08, 2022 at 04:25:12PM -0700, Bart Van Assche wrote:
>>>> One of the functions in the above call stack is sd_remove(). sd_remove()
>>>> calls del_gendisk() just before calling sd_shutdown(). sd_shutdown()
>>>> submits the SYNCHRONIZE CACHE command. In del_gendisk() I found the
>>>> following comment: "Fail any new I/O". Do you agree that failing new I/O
>>>> before sd_shutdown() is called is wrong? Is there any other way to fix this
>>>> than moving the blk_queue_start_drain() etc. calls out of del_gendisk() and
>>>> into a new function?
>>>
>>> That SYNCHRONIZE CACHE is a passthrough command sent on the request_queue
>>> and should not be affected by stopping all file system I/O.
>>
>> When I run check -q srp
>> all the test cases pass but each one stops for 3+ minutes at synchronize cache.
>> The rxe device is still active until sync cache returns when the last QP and the PD
>> are destroyed. It may be that the queues are blocked waiting for something else
>> even though they have reported success??
> 
> If you remove all the xarray patches and use the original source code. This will not occur.
> 
> Zhu Yanjun
> 

I know. I am trying to find out why. For performance reasons I very much want to
make the xarray + rcu_locking patches work correctly. All the spinlock issues are
now fixed in my tree. What is left is a race in the RDMA read retry code somewhere.
I'll find it. In the process of chasing this I have found several real bugs and
I suspect a few more are out there.

Bob