On 2019/1/16 6:03, Jason Gunthorpe wrote: > On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: >> Hi Jason, >> >> On 2019/1/12 5:25, Jason Gunthorpe wrote: >>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: >>>> AEQ overflow will be reported by hardware when too many >>>> asynchronous events occurred but not be handled in time. >>>> Normally, AEQ overflow error is not easy to occur. Once >>>> happened, we have to do physical function reset to recover. >>>> PF reset is implemented in two steps. Firstly, set reset >>>> level with ae_dev->ops->set_default_reset_request. >>>> Secondly, run reset with ae_dev->ops->reset_event. >>>> >>>> Signed-off-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx> >>>> Signed-off-by: Yixian Liu <liuyixian@xxxxxxxxxx> >>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ >>>> 1 file changed, 11 insertions(+) >>> >>> Why should this be a -rc patch? It doesn't look like it meets the >>> requires for a stable kernel, or fixing something introduced in the >>> merge window. >>> >>> This looks like a new feature to me. >> >> I think we could take this as a bug. Because the device can't >> continue working, if we don't handle the AEQ overflow error. And the >> code is simple and doesn't bring any unstable factor. Then i take >> it as -rc patch. > > recovery from hardware failures is a new feature. > > If this is fixing a bug in the existing recovery, or is a major > problem that many users are hitting, then explain in the commit > comment. > Hi Jason, Thank you very much. How about change the commit log as following? RDMA/hns: fix the issue of handling AEQ overflow for hip08 AEQ overflow will be reported by hardware when too many asynchronous events occurred but not be handled in time. Currently, we just do the print when AEQ overflow happened. This is not right. We should also do physical function reset, otherwise, the device can't continue working. PF reset is implemented in two steps. Firstly, set reset level with ae_dev->ops->set_default_reset_request. Secondly, run reset with ae_dev->ops->reset_event. Signed-off-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx> Signed-off-by: Yixian Liu <liuyixian@xxxxxxxxxx> > Jason > > -- thanks tanxiaofei