Hello!
On 30.06.2020 8:22, Yoshihiro Shimoda wrote:
From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM
According to the report of [1], this driver is possible to cause
the following error in ravb_tx_timeout_work().
ravb e6800000.ethernet ethernet: failed to switch device to config mode
This error means that the hardware could not change the state
from "Operation" to "Configuration" while some tx queue is operating.
After that, ravb_config() in ravb_dmac_init() will fail, and then
any descriptors will be not allocaled anymore so that NULL porinter
Pointer. :-)
dereference happens after that on ravb_start_xmit().
Such a case is possible to be caused because this driver supports
two queues (NC and BE) and the ravb_stop_dma() is possible to return
without any stopping process if TCCR or CSR register indicates
the hardware is operating for TX.
Maybe we should just fix those blind assumptions?
To fix the issue, just try to wake the subqueue on
ravb_tx_timeout_work() if the descriptors are not full instead
of stop all transfers (all queues of TX and RX).
[1]
https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@xxxxxxxxxxxx/
Reported-by: Dirk Behme <dirk.behme@xxxxxxxxxxxx>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@xxxxxxxxxxx>
---
I'm guessing that this issue is possible to happen if:
- ravb_start_xmit() calls netif_stop_subqueue(), and
- ravb_poll() will not be called with some reason, and
- netif_wake_subqueue() will be not called, and then
- dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout().
However, unfortunately, I didn't reproduce the issue yet.
To be honest, I'm also guessing other queues (SR) of this hardware
which out-of tree driver manages are possible to reproduce this issue,
but I didn't try such environment for now...
So, I marked RFC on this patch now.
I'm afraid, but do you have any comments about this patch?
I agree that we should now reset only the stuck queue, not both but I
doubt your solution is good enough. Let me have another look...
Thank you for your comment! I hope this solution is good enough...
I'm sorry again and again. But, do you have any time to look this patch?
Yes, in the sense of reviewing -- I don't consider it complete. And no, in
the sense of looking into the issue myself... Can we do a per-queue tear-down
and re-init (not necessarily all in 1 patch)?
In fact, it would ensue many changes...
Thank you for your comment! I'm not sure this "re-init" mean. But, we can do
Well, I meant the ring re-allocation and re-formatting... but (looking at
sh_eth) the former is not really necessary, it's enough to just stop the TX
ring and then re-format it and re-start... Well, unfortunately, the way I
structured the code, we can't do *just* that...
a per-queue tear-down if DMAC is still working. And, we can prepare new descriptors
for the queue after tear-down.
< Tear-down >
1. Set DT_EOS to the desc_bat[queue].
2. Set DLR.LBAx to 1.
3. Check if DLA.LBAx is cleared.
DLR.LBAx, you mean?
Well, I was thinking of polling TCCR and CSR like the current
ravb_stop_dma() does, but if that works...
< Prepare new descriptors and start again >
4. Prepare new descriptors.
That's where the cause for using the workqueue lies -- the descriptors are
allocated with GFP_KERNEL, not GFP_ATOMIC... if you have time/desire to
untangle all this, I'd appreciate it; else I'd have to work on this in my
copious free time... :-)
5. Set DT_LINKFIX to the desc_bat[queue].
6. Set DLR.LBAx to 1.
7. Check if DLA.LBAx is cleared.
I'm thinking doing tear-down and "re-init" of the descriptors is better
than just try to wake the subqueue. But, what do you think?
Definitely better, yes...
Best regards,
Yoshihiro Shimoda
MBR, Sergei