RE: [PATCH/RFC] net: ethernet: ravb: Try to wake subqueue instead of stop on timeout

Yoshihiro Shimoda <yoshihiro.shimoda.uh@xxxxxxxxxxx> · Mon, 20 Jul 2020 01:48:33 +0000

Hello!

> From: Sergei Shtylyov, Sent: Monday, July 20, 2020 4:20 AM
> 
> Hello!
> 
>    Sorry about another late reply, was having h/w issues on the new work...

No problem! :) Thank you for your reply!

> On 07/06/2020 12:25 PM, Yoshihiro Shimoda wrote:
<snip>
> >>     Maybe we should just fix those blind assumptions?
> >
> > Maybe I should have described some facts instead of assumptions like below?
> > If so, I should modify the code too.
> >
> > After ravb_stop_dma() was called, the driver assumed any transfers were
> > stopped. However, the current ravb_tx_timeout_work() doesn't check whether
> > the ravb_stop_dma() is succeeded without any error or not. So, we should
> > fix it.
> 
>    Yes. Better a stuck TX queue (with a chance to recover) than kernel oops...

I got it.

<snip>
> >>     Well, I was thinking of polling TCCR and CSR like the current
> >> ravb_stop_dma() does, but if that works...
> >
> > I'm not sure whether polling TCCR and CSR is enough or not.
> > Instead of polling those registers, maybe we should poll whether
> > ravb_stop_dma() is succeeded or not?
> 
>    Yes, if by polling you mean just checking the result of it. :-)

Yes, I intend to check the result of it :)

> > Especially, result of ravb_config() is
> > a key point whether the hardware is really stopped or not.
> > So, I'm thinking that just polling the ravb_stop_dma() in
> > ravb_tx_timeout_work() is better than the per-queue tear-down and
> > re-init now. But, what do you think?
> 
>    I don't think it's better since we're now supposed to handle a per-queue
> TX timeout (still not sure it's possible with this h/w). But of course, it's
> better as it's simple enough for a bug fix.

Thank you for your comment. Yes, I agree it's better to handle a per-queue TX timeout.
However, I think we need refactoring for it. So, I'd like to fix a bug by simple code.

> >>> < Prepare new descriptors and start again >
> >>> 4. Prepare new descriptors.
> >>
> >>     That's where the cause for using the workqueue lies -- the descriptors are
> >> allocated with GFP_KERNEL, not GFP_ATOMIC...
> >
> > IIUC, we can avoid to use the workqueue if re-allocation is not really necessary.
> >
> >> if you have time/desire to
> >> untangle all this, I'd appreciate it; else I'd have to work on this in my
> >> copious free time... :-)
> >
> > If we don't need refactoring, I think I can do it :)
> 
>    Let's go forward with the simple fix (assuming it fixes the original oops).

Thank you for your comment! I'll do that.

Best regards,
Yoshihiro Shimoda

> [...]
> 
> MBR, Sergei