On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote: > > On 2019/1/16 6:03, Jason Gunthorpe wrote: > > On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: > >> Hi Jason, > >> > >> On 2019/1/12 5:25, Jason Gunthorpe wrote: > >>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: > >>>> AEQ overflow will be reported by hardware when too many > >>>> asynchronous events occurred but not be handled in time. > >>>> Normally, AEQ overflow error is not easy to occur. Once > >>>> happened, we have to do physical function reset to recover. > >>>> PF reset is implemented in two steps. Firstly, set reset > >>>> level with ae_dev->ops->set_default_reset_request. > >>>> Secondly, run reset with ae_dev->ops->reset_event. > >>>> > >>>> Signed-off-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx> > >>>> Signed-off-by: Yixian Liu <liuyixian@xxxxxxxxxx> > >>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ > >>>> 1 file changed, 11 insertions(+) > >>> > >>> Why should this be a -rc patch? It doesn't look like it meets the > >>> requires for a stable kernel, or fixing something introduced in the > >>> merge window. > >>> > >>> This looks like a new feature to me. > >> > >> I think we could take this as a bug. Because the device can't > >> continue working, if we don't handle the AEQ overflow error. And the > >> code is simple and doesn't bring any unstable factor. Then i take > >> it as -rc patch. > > > > recovery from hardware failures is a new feature. > > > > If this is fixing a bug in the existing recovery, or is a major > > problem that many users are hitting, then explain in the commit > > comment. > > > Hi Jason, > > Thank you very much. > How about change the commit log as following? > > RDMA/hns: fix the issue of handling AEQ overflow for hip08 > > AEQ overflow will be reported by hardware when too many > asynchronous events occurred but not be handled in time. > > Currently, we just do the print when AEQ overflow happened. > This is not right. We should also do physical function reset, > otherwise, the device can't continue working. > > PF reset is implemented in two steps. Firstly, set reset > level with ae_dev->ops->set_default_reset_request. > Secondly, run reset with ae_dev->ops->reset_event. This still doesn't explain why this is for -rc. New features are not for the -rc tree. You'd need to explain how/why that this is a widespread a critical problem in your user base Jason