Re: [PATCH resend v3 rdma-rc 1/1] RDMA/hns: Add the process of AEQ overflow for hip08

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote:
> 
> On 2019/1/16 6:03, Jason Gunthorpe wrote:
> > On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote:
> >> Hi Jason,
> >>
> >> On 2019/1/12 5:25, Jason Gunthorpe wrote:
> >>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
> >>>> AEQ overflow will be reported by hardware when too many
> >>>> asynchronous events occurred but not be handled in time.
> >>>> Normally, AEQ overflow error is not easy to occur. Once
> >>>> happened, we have to do physical function reset to recover.
> >>>> PF reset is implemented in two steps. Firstly, set reset
> >>>> level with ae_dev->ops->set_default_reset_request.
> >>>> Secondly, run reset with ae_dev->ops->reset_event.
> >>>>
> >>>> Signed-off-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx>
> >>>> Signed-off-by: Yixian Liu <liuyixian@xxxxxxxxxx>
> >>>>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
> >>>>  1 file changed, 11 insertions(+)
> >>>
> >>> Why should this be a -rc patch? It doesn't look like it meets the
> >>> requires for a stable kernel, or fixing something introduced in the
> >>> merge window.
> >>>
> >>> This looks like a new feature to me.
> >>
> >> I think we could take this as a bug. Because the device can't
> >> continue working, if we don't handle the AEQ overflow error. And the
> >> code is simple and doesn't bring any unstable factor.  Then i take
> >> it as -rc patch.
> > 
> > recovery from hardware failures is a new feature.
> > 
> > If this is fixing a bug in the existing recovery, or is a major
> > problem that many users are hitting, then explain in the commit
> > comment.
> > 
> Hi Jason,
> 
> Thank you very much.
> How about change the commit log as following?
> 
> RDMA/hns: fix the issue of handling AEQ overflow for hip08
> 
> AEQ overflow will be reported by hardware when too many
> asynchronous events occurred but not be handled in time.
> 
> Currently, we just do the print when AEQ overflow happened.
> This is not right. We should also do physical function reset,
> otherwise, the device can't continue working.
> 
> PF reset is implemented in two steps. Firstly, set reset
> level with ae_dev->ops->set_default_reset_request.
> Secondly, run reset with ae_dev->ops->reset_event.

This still doesn't explain why this is for -rc.

New features are not for the -rc tree.

You'd need to explain how/why that this is a widespread a critical
problem in your user base

Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux