Re: [PATCH for-next v2 1/4] RDMA/erdma: Make the device probe process more robust

Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx> · Mon, 2 Sep 2024 17:09:09 +0800

On 9/2/24 3:21 PM, Leon Romanovsky wrote:
> On Fri, Aug 30, 2024 at 10:34:42AM +0800, Cheng Xu wrote:
>>
>>
>> On 8/29/24 6:09 PM, Leon Romanovsky wrote:
>>> On Wed, Aug 28, 2024 at 02:09:41PM +0800, Cheng Xu wrote:
>>>> Driver may probe again while hardware is destroying the internal
>>>> resources allocated for previous probing
>>>
>>> How is it possible?
>>>
>>
>> The resources I mentioned is totally unseen to driver, it's something related
>> to our device management part in hypervisor, so it won't cause host resources
>> leak, and the cleanup/reset process may take a long time. For these reason,
>> we don't wait the completion of the cleanup/reset in the remove routing.
>> Instead, the driver will wait the device status become ready in probe routing
>> (In most cases, the hardware will have enough time to finish the cleanup/reset
>> before the second probe), so that we can boost the remove process.
> 
> And why don't hypervisor wait for the device to be ready before giving it to VM?

Hypervisor actually does what you described during the first bootup. However, one
scenario is that the erdma driver is unloaded and loaded quickly while the device
always exists in the VM. In this case, there is no opportunity for the hypervisor
to perform that action.

> Why do you need to complicate the probe routine to overcome the hypervisor behavior?
> 

The hardware now requires that the former reset (issued in the remove routine) must be
completed before device init (issued in the probe routine). Waiting the reset completed
either in the remove routine or in the probe routine both can meet the requirement.
This patch chose to wait in the probe routine because it can speed up the remove process.

Actually this is a good question, and inspires me that maybe the requirement in the
hardware/backend may be eliminated, so that simplify the driver process.

I'd like to remove this patch in v3 and leave it for internal discussion.

Thanks very much
Cheng Xu

> Thanks
> 
>>
>> Thanks,
>> Cheng Xu
>>