On 10.01.2023 15:53:22, Marc Kleine-Budde wrote: > On 10.01.2023 12:50:18, Thomas.Kopp@xxxxxxxxxxxxx wrote: > > > For the last case, I attached debugger after the fail an tried to read from the > > > socket until read returned -1: > > > (gdb) p malloc(72) > > > (gdb) p read(sockfd,$1,72) > > > (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise > > > surprise, there were more CAN frames in store. Refer to the log/pdf. > > > > I was finally able to reproduce the issue and capture a full SPI/CAN > > log when it happens. At this point it looks like the tail pointer of > > the FIFO in the MCP2518FD gets corrupted and the driver does what it's > > supposed to do and reads out lots of frames... > > You mean the head pointer? I've reproduced the issue here, too, on a single-core imx6, net-next/master from end of last year. I added some tracing to the driver and recoded the following events. Let me explain how to read this. The "chip" variants are read from the chip via SPI. The "ring" variants are the cached values in the driver. chip-tail chip-delta (head-tail) | chip-head | | | | ring-tail ring-head ring-delta | | | | | | (head-tail) | do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x00000c9b rh=0x00000c9c rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag ^^^^^^^^ These are the decoded lowest 3 bits of the FIFOSTA register. The different runs of the rxif handler (mcp251xfd_handle_rxif()) are separated by a ("----") line. Now the analysis: | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x09 ch=0x0a cd=0x01 rt=0x00000c99 rh=0x00000c9a rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0a ch=0x0b cd=0x01 rt=0x00000c9a rh=0x00000c9b rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x00000c9b rh=0x00000c9c rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag So far everything looks good. The chip delta ("cd"), i.e. the CAN frames waiting to be read is 1. The ring delta is 1, too. Meaning 1 CAN frame has been read. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0c ch=0x0f cd=0x03 rt=0x00000c9c rh=0x00000c9f rd=0x03 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag Now 3 CAN frame are waiting to be read. Thomas told me that he sees 3 CAN frames waiting too, before things go downhill. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0f ch=0x0e cd=0x0f rt=0x00000c9f rh=0x00000cae rd=0x0f | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag This is strange: chip head "jumped" to 0x0e, this means there are 0x0f CAN frames waiting. However the FIFO flags (FIFO Half Full set, FIFO Full not set) are consistent with 0x0f waiting CAN frames, the FIFO size if 0x10. Note: The chip tail "ct=0x0f" is consistent, as the "old" chip head was 0x0f. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0e cd=0x10 rt=0x00000cae rh=0x00000cbe rd=0x10 | TFERFFIF x Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag Chip tail is consistent (0x0e) as it equals the old chip head. chip head is also 0x0e, this means either an empty or a full FIFO, the flags indicate a full FIFO. The following events seem to be OK: | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0f cd=0x01 rt=0x00000cbe rh=0x00000cbf rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0f ch=0x00 cd=0x01 rt=0x00000cbf rh=0x00000cc0 rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- Final note on the number of send and received CAN frames. The ring head (in the event above) shows 0x0cc0 received CAN frames. The sequence number in the register dump shows 0x000c9f, which means a total of 0x000ca0 CAN frames have been send. | TX-0 Object: 0x03 (0xaf8) | id = 0x000002a5 | flags = 0x00193e08 | SEQ_MCP2517FD = 0x00001f Sequence (MCP2517) | SEQ_MCP2518FD = 0x000c9f Sequence (MCP2518) | data = 00 03 69 df c5 ee 62 77 The BDIAG1 register gives a total of 0x0ca0 error free messages, looks consistent, too Thomas is this RX'ed or TX'ed error free messages? | BDIAG1: bdiag1(0x03c)=0x00000ca0 [...] | EFMSGCNT = 0x0ca0 Error Free Message Counter regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Attachment:
signature.asc
Description: PGP signature