Hi Jason, On 2019/1/19 3:58, Jason Gunthorpe wrote: > On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote: >> >> On 2019/1/16 6:03, Jason Gunthorpe wrote: >>> On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: >>>> Hi Jason, >>>> >>>> On 2019/1/12 5:25, Jason Gunthorpe wrote: >>>>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: >>>>>> AEQ overflow will be reported by hardware when too many >>>>>> asynchronous events occurred but not be handled in time. >>>>>> Normally, AEQ overflow error is not easy to occur. Once >>>>>> happened, we have to do physical function reset to recover. >>>>>> PF reset is implemented in two steps. Firstly, set reset >>>>>> level with ae_dev->ops->set_default_reset_request. >>>>>> Secondly, run reset with ae_dev->ops->reset_event. >>>>>> >>>>>> Signed-off-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx> >>>>>> Signed-off-by: Yixian Liu <liuyixian@xxxxxxxxxx> >>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ >>>>>> 1 file changed, 11 insertions(+) >>>>> >>>>> Why should this be a -rc patch? It doesn't look like it meets the >>>>> requires for a stable kernel, or fixing something introduced in the >>>>> merge window. >>>>> >>>>> This looks like a new feature to me. >>>> >>>> I think we could take this as a bug. Because the device can't >>>> continue working, if we don't handle the AEQ overflow error. And the >>>> code is simple and doesn't bring any unstable factor. Then i take >>>> it as -rc patch. >>> >>> recovery from hardware failures is a new feature. >>> >>> If this is fixing a bug in the existing recovery, or is a major >>> problem that many users are hitting, then explain in the commit >>> comment. >>> >> Hi Jason, >> >> Thank you very much. >> How about change the commit log as following? >> >> RDMA/hns: fix the issue of handling AEQ overflow for hip08 >> >> AEQ overflow will be reported by hardware when too many >> asynchronous events occurred but not be handled in time. >> >> Currently, we just do the print when AEQ overflow happened. >> This is not right. We should also do physical function reset, >> otherwise, the device can't continue working. >> >> PF reset is implemented in two steps. Firstly, set reset >> level with ae_dev->ops->set_default_reset_request. >> Secondly, run reset with ae_dev->ops->reset_event. > > This still doesn't explain why this is for -rc. > > New features are not for the -rc tree. > > You'd need to explain how/why that this is a widespread a critical > problem in your user base > Maybe this patch is not suitable for the -rc branch. Hmm, I will resend this patch based on -next. Many thanks. > Jason > > . > -- thanks tanxiaofei