Re: m_can: a lot of 'Rx FIFO 0 Message Lost' in dmesg

Mariusz Madej <mariusz.madej@xxxxxxxxxx> · Sat, 27 Feb 2021 05:03:14 +0100

On 2/26/21 2:37 PM, Torin Cooper-Bennun wrote

The only place in m_can.c file, where interrupt register is cleared is function
called when interrupt arrives

static irqreturn_t m_can_isr(int irq, void *dev_id)
{
.
.
         /* ACK all irqs */
         if (ir & IR_ALL_INT)
                 m_can_write(cdev, M_CAN_IR, ir);
.
.
}

But when we enter 'NAPI mode' in heavy load we are never get to this function
until load gets lower and interrupts are enabled again. In this situation,
this code:
The m_can driver handles the IRQ by offloading the RX to a NAPI queue,
so the RX procedure is deferred, and is scheduled to happen at a
(slightly) later time. As far as I understand it, interrupts are not
disabled at any point.

Interupts are disabled in m_can_isr function:

if ((ir & IR_RF0N) || (ir & IR_ERR_ALL_30X)) {
        cdev->irqstatus = ir;
        m_can_disable_all_interrupts(cdev);    <--------HERE
        if (!cdev->is_peripheral)
                napi_schedule(&cdev->napi);
        else
                m_can_rx_peripheral(dev);

}

and they are enabled conditionaly in function:

static int m_can_poll(struct napi_struct *napi, int quota)
{
        struct net_device *dev = napi->dev;
        struct m_can_classdev *cdev = netdev_priv(dev);
        int work_done;

        work_done = m_can_rx_handler(dev, quota);
        if (work_done < quota) {
                napi_complete_done(napi, work_done);
                m_can_enable_all_interrupts(cdev); <---- HERE
        }

        return work_done;
}

so if work_done==quota(64) napi will schedule next receiving instead
of enabling interrupts. That is why i wrote that in my condition i dont get
to m_can_isr function and message lost interrupt is not cleared. As a result
my device enters to this function:

static int m_can_do_rx_poll(struct net_device *dev, int quota)
{
        struct m_can_classdev *cdev = netdev_priv(dev);
        u32 pkts = 0;
        u32 rxfs;

        rxfs = m_can_read(cdev, M_CAN_RXF0S);
        if (!(rxfs & RXFS_FFL_MASK)) {
                netdev_dbg(dev, "no messages in fifo0\n");
                return 0;
        }

        while ((rxfs & RXFS_FFL_MASK) && (quota > 0)) {
                if (rxfs & RXFS_RFL)
                        netdev_warn(dev, "Rx FIFO 0 Message Lost\n");

                m_can_read_fifo(dev, rxfs);

                quota--;
                pkts++;
                rxfs = m_can_read(cdev, M_CAN_RXF0S);
        }

        if (pkts)
                can_led_event(dev, CAN_LED_EVENT_RX);

        return pkts;
}

With RXFS_RFL==true and 64 messages to be read, that is why i have 64 warnings
in a row.
Those warnings take cpu time, and in this time fifo is full again so
function m_can_poll does not enable interrupts again, and so on...

That is why we got so many messages in a row for so long time. So clearing
RXFS_RFL bit after warning is issued could be a solution.
RXFS_RFL is a flag in a status register, not an interrupt flag. There is
a corresponding interrupt flag, but that is cleared along with the rest,
at the top of m_can_isr.

I agree, sorry for not being specific, the problem is cpu can not get into
m_can_isr for a long time in my case.

I think you are losing messages because the traffic is too heavy for
your system to read out the messages fast enough. That is the usual
reason for seeing "Rx FIFO 0 Message Lost".

Seeing "Rx FIFO 0 Message Lost" is not my biggest problem. The problem is
my system is not responsive along this messages.

I changed m_can_do_rx_poll:

static int m_can_do_rx_poll(struct net_device *dev, int quota)
{
        struct m_can_classdev *cdev = netdev_priv(dev);
        u32 pkts = 0;
        u32 rxfs;

        rxfs = m_can_read(cdev, M_CAN_RXF0S);
        if (!(rxfs & RXFS_FFL_MASK)) {
                netdev_dbg(dev, "no messages in fifo0\n");
                return 0;
        }

        while ((rxfs & RXFS_FFL_MASK) && (quota > 0)) {
                if (rxfs & RXFS_RFL) {
                        netdev_warn(dev, "Rx FIFO 0 Message Lost\n");
                        m_can_write(cdev, M_CAN_IR, IR_RF0L);
                }

                m_can_read_fifo(dev, rxfs);

                quota--;
                pkts++;
                rxfs = m_can_read(cdev, M_CAN_RXF0S);
        }

        if (pkts)
                can_led_event(dev, CAN_LED_EVENT_RX);

        return pkts;
}

And now my system is responsive - i sometimes get "Rx FIFO 0 Message Lost"
but one at a time - not 100k and this is not a big problem for me.
CAN works OK
So IMO it is a bug.

--
Regards,

Torin Cooper-Bennun
Software Engineer  | maxiluxsystems.com

Regards,

Mariusz