RE: [bug report] blktests srp/002 hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Sep 24, 2023 10:18 AM Rain River wrote:
> On Sat, Sep 23, 2023 at 2:14 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote:
> >
> > On 9/21/23 10:10, Zhu Yanjun wrote:
> > >
> > > 在 2023/9/21 22:39, Bob Pearson 写道:
> > >> On 9/21/23 09:23, Rain River wrote:
> > >>> On Thu, Sep 21, 2023 at 2:53 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote:
> > >>>> On 9/20/23 12:22, Bart Van Assche wrote:
> > >>>>> On 9/20/23 10:18, Bob Pearson wrote:
> > >>>>>> But I have also seen the same behavior in the siw driver which is
> > >>>>>> completely independent.
> > >>>>> Hmm ... I haven't seen any hangs yet with the siw driver.
> > >>>> I was on Ubuntu 6-9 months ago. Currently I don't see hangs on either.
> > >>>>>> As mentioned above at the moment Ubuntu is failing rarely. But it used to fail reliably (srp/002 about 75% of
> the time and srp/011 about 99% of the time.) There haven't been any changes to rxe to explain this.
> > >>>>> I think that Zhu mentioned commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue
> > >>>>> support for rxe tasks")?
> > >>>> That change happened well before the failures went away. I was seeing failures at the same rate with tasklets
> > >>>> and wqs. But after updating Ubuntu and the kernel at some point they all went away.
> > >>> I made tests on the latest Ubuntu with the latest kernel without the
> > >>> commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
> > >>> The latest kernel is v6.6-rc2, the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
> > >>> workqueue support for rxe tasks") is reverted.
> > >>> I made blktest tests for about 30 times, this problem does not occur.
> > >>>
> > >>> So I confirm that without this commit, this hang problem does not
> > >>> occur on Ubuntu without the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
> > >>> workqueue support for rxe tasks").
> > >>>
> > >>> Nanthan
> > >>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Bart.
> > >>>>
> > >> This commit is very important for several reasons. It is needed for the ODP implementation
> > >> that is in the works from Daisuke Matsuda and also for QP scaling of performance. The work
> > >> queue implementation scales well with increasing qp number while the tasklet implementation
> > >> does not. This is critical for the drivers use in large scale storage applications. So, if
> > >> there is a bug in the work queue implementation it needs to be fixed not reverted.
> > >>
> > >> I am still hoping that someone will diagnose what is causing the ULPs to hang in terms of
> > >> something missing causing it to wait.
> > >
> > > Hi, Bob
> > >
> > >
> > > You submitted this commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
> > >
> > > You should be very familiar with this commit.
> > >
> > > And this commit causes regression.
> > >
> > > So you should delved into the source code to find the root cause, then fix it.
> >
> > Zhu,
> >
> > I have spent tons of time over the months trying to figure out what is happening with blktests.
> > As I have mentioned several times I have seen the same exact failure in siw in the past although
> > currently that doesn't seem to happen so I had been suspecting that the problem may be in the ULP.
> > The challenge is that the blktests represents a huge stack of software much of which I am not
> > familiar with. The bug is a hang in layers above the rxe driver and so far no one has been able to
> > say with any specificity the rxe driver failed to do something needed to make progress or violated
> > expected behavior. Without any clue as to where to look it has been hard to make progress.
> 
> Bob
> 
> Work queue will sleep. If work queue sleep for long time, the packets
> will not be sent to ULP. This is why this hang occurs.

In general work queue can sleep, but the workload running in rxe driver
should not sleep because it was originally running on tasklet and converted
to use work queue. A task can sometime take longer because of IRQs, but
the same thing can also happen with tasklet. If there is a difference between
the two, I think it would be the overhead of scheduring the work queue.

> Difficult to handle this sleep in work queue. It had better revert
> this commit in RXE.

I am objected to reverting the commit at this stage. As Bob wrote above,
nobody has found any logical failure in rxe driver. It is quite possible
that the patch is just revealing a latent bug in the higher layers.

> Because work queue sleeps,  ULP can not wait for long time for the
> packets. If packets can not reach ULPs for long time, many problems
> will occur to ULPs.

I wonder where in the rxe driver does it sleep. BTW, most packets are
processed in NET_RX_IRQ context, and work queue is scheduled only
when there is already a running context. If your speculation is to the point,
the hang will occur more frequently if we change it to use work queue exclusively.
My ODP patches include a change to do this.
Cf. https://lore.kernel.org/lkml/7699a90bc4af10c33c0a46ef6330ed4bb7e7ace6.1694153251.git.matsuda-daisuke@xxxxxxxxxxx/

Thanks,
Daisuke

> 
> >
> > My main motivation is making Lustre run on rxe and it does and it's fast enough to meet our needs.
> > Lustre is similar to srp as a ULP and in all of our testing we have never seen a similar hang. Other
> > hangs to be sure but not this one. I believe that this bug will never get resolved until someone with
> > a good understanding of the ulp drivers makes an effort to find out where and why the hang is occurring.
> > From there it should be straight forward to fix the problem. I am continuing to investigate and am learning
> > the device-manager/multipath/srp/scsi stack but I have a long ways to go.
> >
> > Bob
> >
> >
> > >
> > >
> > > Jason && Leon, please comment on this.
> > >
> > >
> > > Best Regards,
> > >
> > > Zhu Yanjun
> > >
> > >>
> > >> Bob
> >




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux