Re: [bug report] blktests srp/002 hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 25, 2023 11:31 PM Zhu Yanjun <yanjun.zhu@xxxxxxxxx> wrote:
> 在 2023/9/25 12:47, Daisuke Matsuda (Fujitsu) 写道:
> > On Sun, Sep 24, 2023 10:18 AM Rain River wrote:
> >> On Sat, Sep 23, 2023 at 2:14 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote:
> >>> On 9/21/23 10:10, Zhu Yanjun wrote:
> >>>> 在 2023/9/21 22:39, Bob Pearson 写道:
> >>>>> On 9/21/23 09:23, Rain River wrote:
> >>>>>> On Thu, Sep 21, 2023 at 2:53 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote:
> >>>>>>> On 9/20/23 12:22, Bart Van Assche wrote:
> >>>>>>>> On 9/20/23 10:18, Bob Pearson wrote:
> >>>>>>>>> But I have also seen the same behavior in the siw driver which is
> >>>>>>>>> completely independent.
> >>>>>>>> Hmm ... I haven't seen any hangs yet with the siw driver.
> >>>>>>> I was on Ubuntu 6-9 months ago. Currently I don't see hangs on either.
> >>>>>>>>> As mentioned above at the moment Ubuntu is failing rarely. But it used to fail reliably (srp/002 about 75%
> of
> >> the time and srp/011 about 99% of the time.) There haven't been any changes to rxe to explain this.
> >>>>>>>> I think that Zhu mentioned commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue
> >>>>>>>> support for rxe tasks")?
> >>>>>>> That change happened well before the failures went away. I was seeing failures at the same rate with tasklets
> >>>>>>> and wqs. But after updating Ubuntu and the kernel at some point they all went away.
> >>>>>> I made tests on the latest Ubuntu with the latest kernel without the
> >>>>>> commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
> >>>>>> The latest kernel is v6.6-rc2, the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
> >>>>>> workqueue support for rxe tasks") is reverted.
> >>>>>> I made blktest tests for about 30 times, this problem does not occur.
> >>>>>>
> >>>>>> So I confirm that without this commit, this hang problem does not
> >>>>>> occur on Ubuntu without the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
> >>>>>> workqueue support for rxe tasks").
> >>>>>>
> >>>>>> Nanthan
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Bart.
> >>>>> This commit is very important for several reasons. It is needed for the ODP implementation
> >>>>> that is in the works from Daisuke Matsuda and also for QP scaling of performance. The work
> >>>>> queue implementation scales well with increasing qp number while the tasklet implementation
> >>>>> does not. This is critical for the drivers use in large scale storage applications. So, if
> >>>>> there is a bug in the work queue implementation it needs to be fixed not reverted.
> >>>>>
> >>>>> I am still hoping that someone will diagnose what is causing the ULPs to hang in terms of
> >>>>> something missing causing it to wait.
> >>>> Hi, Bob
> >>>>
> >>>>
> >>>> You submitted this commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
> >>>>
> >>>> You should be very familiar with this commit.
> >>>>
> >>>> And this commit causes regression.
> >>>>
> >>>> So you should delved into the source code to find the root cause, then fix it.
> >>> Zhu,
> >>>
> >>> I have spent tons of time over the months trying to figure out what is happening with blktests.
> >>> As I have mentioned several times I have seen the same exact failure in siw in the past although
> >>> currently that doesn't seem to happen so I had been suspecting that the problem may be in the ULP.
> >>> The challenge is that the blktests represents a huge stack of software much of which I am not
> >>> familiar with. The bug is a hang in layers above the rxe driver and so far no one has been able to
> >>> say with any specificity the rxe driver failed to do something needed to make progress or violated
> >>> expected behavior. Without any clue as to where to look it has been hard to make progress.
> >> Bob
> >>
> >> Work queue will sleep. If work queue sleep for long time, the packets
> >> will not be sent to ULP. This is why this hang occurs.
> > In general work queue can sleep, but the workload running in rxe driver
> > should not sleep because it was originally running on tasklet and converted
> > to use work queue. A task can sometime take longer because of IRQs, but
> > the same thing can also happen with tasklet. If there is a difference between
> > the two, I think it would be the overhead of scheduring the work queue.
> >
> >> Difficult to handle this sleep in work queue. It had better revert
> >> this commit in RXE.
> > I am objected to reverting the commit at this stage. As Bob wrote above,
> > nobody has found any logical failure in rxe driver. It is quite possible
> > that the patch is just revealing a latent bug in the higher layers.
> 
> To now, on Debian and Fedora, all the tests with work queue will hang.
> And after reverting this commit,
> 
> no hang will occur.
> 
> Before new test results, it is a reasonable suspect that this commit
> will result in the hang.

If the hang *always* occurs, then I agree your opinion is correct,
but this one happens occasionally. It is also natural to think that
the commit makes it easier to meet the condition of an existing bug.

> 
> >
> >> Because work queue sleeps,  ULP can not wait for long time for the
> >> packets. If packets can not reach ULPs for long time, many problems
> >> will occur to ULPs.
> > I wonder where in the rxe driver does it sleep. BTW, most packets are
> > processed in NET_RX_IRQ context, and work queue is scheduled only
> 
> Do you mean NET_RX_SOFTIRQ?

Yes. I am sorry for confusing you.

Thanks,
Daisuke

> 
> Zhu Yanjun
> 
> > when there is already a running context. If your speculation is to the point,
> > the hang will occur more frequently if we change it to use work queue exclusively.
> > My ODP patches include a change to do this.
> > Cf.
> https://lore.kernel.org/lkml/7699a90bc4af10c33c0a46ef6330ed4bb7e7ace6.1694153251.git.matsuda-daisuke@fujitsu.c
> om/
> >
> > Thanks,
> > Daisuke
> >
> >>> My main motivation is making Lustre run on rxe and it does and it's fast enough to meet our needs.
> >>> Lustre is similar to srp as a ULP and in all of our testing we have never seen a similar hang. Other
> >>> hangs to be sure but not this one. I believe that this bug will never get resolved until someone with
> >>> a good understanding of the ulp drivers makes an effort to find out where and why the hang is occurring.
> >>>  From there it should be straight forward to fix the problem. I am continuing to investigate and am learning
> >>> the device-manager/multipath/srp/scsi stack but I have a long ways to go.
> >>>
> >>> Bob
> >>>
> >>>
> >>>>
> >>>> Jason && Leon, please comment on this.
> >>>>
> >>>>
> >>>> Best Regards,
> >>>>
> >>>> Zhu Yanjun
> >>>>
> >>>>> Bob




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux