Re: [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 24 Nov 2022 15:00:53 -0400



On Wed, Nov 16, 2022 at 10:31:04AM +0800, Cheng Xu wrote:
> Hi,
> 
> This series introduces the support of flushing all WRs posted to hardware
> after QP state changed to ERROR.
> 
> Old Firmware may not flush the newly posted WRs after QP state chagned to
> ERROR, because it's a little difficult for firmware to get the realtime
> PI (producer index) of QPs, especially for the RQs.
> 
> Previously we want to avoid this issue by implementing custom
> drain_{sq/rq} [1], but this has falw, as Tom and Jason pointed out, which
> we also meet in some scenarios, for example, NoF fatal recovery.
> 
> So, we introduce a new mechanism to fix this. When registering the ibdev,
> we create a workqueue for reflushing (we name it "reflush", because
> hardware is already start flushing for the QPs at that time, and it's used
> for hardware to flush newly posted WRs). Once QP needs to flush WRs, or
> new WRs posted after flushing, we post a delay work to the workqueue or
> modify the delay time if is already posted. In the work, driver notifies
> the lastest PIs to firmware by CMDQ, so that firmware can flush all the
> newly posted WRs. This applies to kernel QP first.
> 
> - #1 adds a workqueue for WRs reflushing.
> - #2 adds a reflushing work for each QP.
> - #4 notifies the lastest PIs to firmware for reflushing.
> 
> [1] https://lore.kernel.org/all/20220824094251.23190-3-chengyou@xxxxxxxxxxxxxxxxx/t/
> 
> Thanks,
> Cheng Xu
> 
> Cheng Xu (3):
>   RDMA/erdma: Add a workqueue for WRs reflushing
>   RDMA/erdma: Implement the lifecycle of reflushing work for each QP
>   RDMA/erdma: Notify the latest PI to FW for reflushing when necessary

Applied to for-next, thanks

Jason