Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal

Ronald Wahl <ronald.wahl@xxxxxxxxxxx> · Wed, 29 Nov 2023 12:05:52 +0100

Hi Miquel,

On 29.11.23 09:49, Miquel Raynal wrote:
Hi Ronald,

+ Richard, my dear jffs2 expert ;)

ronald.wahl@xxxxxxxxxxx wrote on Mon, 27 Nov 2023 18:54:40 +0100:

On 27.11.23 16:10, Ronald Wahl wrote:
On 27.11.23 10:58, Miquel Raynal wrote:
The intended move from wait_for_completion_*() to
wait_for_completion_interruptible_*() was to allow (very) long spi memor
y
transfers to be stopped upon user request instead of freezing the
machine forever as the timeout value could now be significantly bigger.

However, depending on the user logic, applications can receive many
signals for their own "internal" purpose and have nothing to do with the
requested kernel operations, hence interrupting spi transfers upon any
signal is probably not a wise choice. Instead, let's switch to
wait_for_completion_killable_*() to only catch the "important"
signals. This was likely the intended behavior anyway.

Actually this seems to work. But aborting a process that has a SPI
transfer running causes ugly messages from kernel. This is somehow
unexpected:

# dd if=/dev/urandom of=/flashdisk/testfile bs=1024 count=512
^C[  380.726760] spi-nor spi0.0: spi transfer canceled
[  380.731688] spi-nor spi0.0: SPI transfer failed: -512
[  380.737141] spi_master spi0: failed to transfer one message from queue
[  380.746495] spi-nor spi0.0: spi transfer canceled
[  380.751549] spi-nor spi0.0: SPI transfer failed: -512
[  380.756844] spi_master spi0: failed to transfer one message from queue

JFFS2 also logs an informational message which is less visible but also
may rise eyebrows:
[  380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed. retu
rned
-512, retlen 68

Killing a process is something to expect in certain cases and it should
not cause such messages which may create some anxiety that something bad
had happened. So maybe the "kill" case should be silent (e.g. level
"debug")
but without out hiding real errors. But even when hiding the message in t
he
SPI framework it may cause additional messages in upper layers like JFFS2
.
I'm not sure whether all of this is a good idea. This is something others
have to decide.

... and now I just got a crash when unmounting and remounting jffs2:

unmount:
[ 8245.821105] spi-nor spi0.0: spi transfer canceled
[ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512
[ 8245.831508] spi_master spi0: failed to transfer one message from queue
[ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. returned -5
12, retlen 68
[ 8245.839786] spi-nor spi0.0: spi transfer canceled
[ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512
[ 8245.850145] spi_master spi0: failed to transfer one message from queue
[ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. returned -5
12, retlen 0
[ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty because the flash driver returned retlen zero

mount:
[ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new ref 28b
d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x000ad578
-0x000ad578)
[ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] THUMB2
[ 8831.234996] CPU: 0 PID: 1142 Comm: mount Not tainted 6.6.2-sama5 #1
[ 8831.241587] Hardware name: Atmel SAMA5
[ 8831.245478] PC is at jffs2_link_node_ref+0xe/0xe2
[ 8831.250360] LR is at jffs2_link_node_ref+0xb9/0xe2
[ 8831.255473] pc : [<c020969c>]    lr : [<c0209747>]    psr: 00000033
[ 8831.261893] sp : c974dd78  ip : 00000000  fp : c09c6a5c
[ 8831.267428] r10: c1572f18  r9 : 0000e002  r8 : c2849964
[ 8831.272801] r7 : 00001a44  r6 : 000ae5bc  r5 : c14df4a8  r4 : c1620208
[ 8831.279647] r3 : 00000001  r2 : 40000000  r1 : c090ce3c  r0 : 00000093
[ 8831.286340] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA Thumb  Segme
nt none
[ 8831.293974] Control: 50c53c7d  Table: 21470059  DAC: 00000051
[ 8831.300025] Register r0 information: non-paged memory
[ 8831.305239] Register r1 information: non-slab/vmalloc memory
[ 8831.311220] Register r2 information: non-paged memory
[ 8831.316427] Register r3 information: non-paged memory
[ 8831.321630] Register r4 information: slab kmalloc-2k start c1620000 poin
ter offset 520 size 2048
[ 8831.330796] Register r5 information: slab jffs2_refblock start c14df3e0 pointer offset 200 size 248
[ 8831.340219] Register r6 information: non-paged memory
[ 8831.345422] Register r7 information: non-paged memory
[ 8831.350784] Register r8 information: slab kmalloc-4k start c2849000 poin
ter offset 2404 size 4096
[ 8831.360032] Register r9 information: non-paged memory
[ 8831.365243] Register r10 information: slab pde_opener start c1572f18 poi
nter offset 0 size 24
[ 8831.374137] Register r11 information: non-slab/vmalloc memory
[ 8831.380041] Register r12 information: NULL pointer
[ 8831.385140] Process mount (pid: 1142, stack limit = 0xb9d2bb39)
[ 8831.391215] Stack: (0xc974dd78 to 0xc974e000)
[ 8831.395884] dd60:                                                       000ad578 000ae5bc
[ 8831.404242] dd80: 000ad578 000ad578 c2849dec c2805e00 c1620208 c2849dec 00000000 c021130d
[ 8831.412762] dda0: c1572d50 00000694 0000069c 00000036 00010000 c2937755 00000000 20061985
[ 8831.421280] ddc0: 0000069c 00000000 c2849964 c2805e00 c1620208 c2849964 0000069c 00000008
[ 8831.429799] dde0: 00000000 c2849ff8 c2849000 c020c40b c09c6a5c c974de34 c974de30 00000000
[ 8831.438154] de00: 00000004 0000069c 00001000 c2937bc0 00000208 c1620000 00000000 000a0000
[ 8831.446665] de20: 00000000 00000000 0000000a 00070000 c2849000 00000000 00001000 00000000
[ 8831.455184] de40: 00000000 00000000 00010000 c2805e00 00000000 00000000 00200000 c2937e00
[ 8831.463704] de60: 00000000 00008000 c0975a98 c020e02d c1001300 00000200 c020f3eb 00000dc0
[ 8831.472062] de80: c2937e00 00000000 c020f3eb c2805e00 c156fc80 c29b9400 00200000 c2937e00
[ 8831.480584] dea0: 00000000 c020f42d c29b9400 c156fc80 c09c6a64 c29b9400 c1298c00 c156fc80
[ 8831.489100] dec0: c020f5c9 c02a853d 0000003a 00000000 c156fc80 c020f5c9 c974df58 c02a85ff
[ 8831.497458] dee0: c156fc80 c020f5c9 00000000 c020f5bf c156fc80 c156fc80 00000020 c016fdc3
[ 8831.505976] df00: c156fc80 00000000 c1374780 c0115c75 00000000 00000000 c156fc80 c0182209
[ 8831.514497] df20: c2937e00 c974df58 c2937f80 00000000 c2937e00 c2937f80 00008000 00000000
[ 8831.523015] df40: 0006d71a 00008000 00068991 c0182307 00000000 c015048b c1360d90 c18b3a18
[ 8831.531370] df60: 00001000 c2937f80 c2937e00 00000000 00000015 c0182539 00000000 0006d71a
[ 8831.539881] df80: 0006d725 00000000 0006d70c 0006d71a 00000015 c010023c c1374780 00000015
[ 8831.548399] dfa0: 00068991 c0100041 00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000
[ 8831.556908] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991
[ 8831.565263] dfe0: b6ea606c beb82b9c 00039818 b6ea607c a0000010 0006d70c 00000000 00000000
[ 8831.573779]  jffs2_link_node_ref from jffs2_sum_scan_sumnode+0x1a5/0x320
[ 8831.580860]  jffs2_sum_scan_sumnode from jffs2_scan_medium+0x2d7/0xab0
[ 8831.587595]  jffs2_scan_medium from jffs2_do_mount_fs+0xeb/0x356
[ 8831.593958]  jffs2_do_mount_fs from jffs2_do_fill_super+0xf7/0x182
[ 8831.600332]  jffs2_do_fill_super from mtd_get_sb+0x61/0x98
[ 8831.606176]  mtd_get_sb from get_tree_mtd+0x4f/0xe8
[ 8831.611241]  get_tree_mtd from vfs_get_tree+0x13/0x7c
[ 8831.616648]  vfs_get_tree from path_mount+0x409/0x4d4
[ 8831.621891]  path_mount from do_mount+0x33/0x40
[ 8831.626593]  do_mount from sys_mount+0xeb/0xfe
[ 8831.631367]  sys_mount from ret_fast_syscall+0x1/0x5c
[ 8831.636591] Exception stack(0xc974dfa8 to 0xc974dff0)
[ 8831.641961] dfa0:                   00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000
[ 8831.650314] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991
[ 8831.658818] dfe0: b6ea606c beb82b9c 00039818 b6ea607c
[ 8831.664195] Code: 6a63 b085 990a b903 (de02) 6ae5
[ 8831.669135] ---[ end trace 0000000000000000 ]---
[ 8831.673902] Kernel panic - not syncing: Fatal exception

The previous aborted I/Os from a dd may also play a role here but the crash
  is
clearly a cause of the interrupted transfers. It's a bit odd that it is an
undefined instruction but probably it was a stack overflow.

Looking back in the SPI driver history I see some occasions where interrupt
ible
transfers were changed to non-interruptible transfers because filesystems l
ike
jffs2 used SIGKILL during unmount or maybe other reasons:

775c4c0032c4 "spi: stm32-qspi: remove signal sensitive on completion"
26cfc0dbe43a "spi-zynq-qspi: use wait_for_completion_timeout to make zynq_q
spi_exec_mem_op not interruptible"
7f3ac71ac3b0 "spi: davinci: fix spurious i/o error"

Having them in spi-atmel now seems a bit off and even a source of misbehavi
our
and even crashes.

It's not just spi-atmel, any spi-mem controller might be tempted to use
interruptible^Wkillable transfers just because the timeout values can
be really big as the memory sizes increase.

One solution is to change the completion helpers back to something
non-killable/non-interruptible, but the user experience will be
slightly degraded. The other would be to look into jffs2 (if it's the
only filesystem playing with signals during unmount, tbh I don't know).
But maybe this signaling mechanism can't be hacked for compatibility
reasons. Handling this at the spi level would be a mix of layers, I'm
not ready for that.

Richard, Mark, what's your opinion here?

I think you will need to revert to uninterruptible transfers.
Just tested with the more modern UBIFS:

# dd if=/dev/random of=/flashdisk/testfile bs=1024 count=512
^C

[   74.281739] spi-nor spi1.0: spi transfer canceled
[   74.286768] spi-nor spi1.0: SPI transfer failed: -512
[   74.292090] spi_master spi1: failed to transfer one message from queue
[   74.299002] ubi0 error: ubi_io_write: error -512 while writing 48 bytes to PEB 38:4224, written 0 bytes
[   74.308884] CPU: 0 PID: 1129 Comm: dd Not tainted 6.6.2-sama5 #2
[   74.315070] Hardware name: Atmel SAMA5
[   74.319159]  unwind_backtrace from show_stack+0x11/0x12
[   74.324597]  show_stack from dump_stack_lvl+0x25/0x3e
[   74.330020]  dump_stack_lvl from ubi_io_write+0x1a7/0x220
[   74.335616]  ubi_io_write from ubi_eba_write_leb+0x91/0x1dc
[   74.341554]  ubi_eba_write_leb from ubi_leb_write+0x99/0xbc
[   74.347311]  ubi_leb_write from ubifs_leb_write+0x4b/0x8c
[   74.352923]  ubifs_leb_write from ubifs_wbuf_sync_nolock+0x141/0x1d4
[   74.359649]  ubifs_wbuf_sync_nolock from ubifs_sync_wbufs_by_inode+0x51/0x6a
[   74.367061]  ubifs_sync_wbufs_by_inode from ubifs_fsync+0x6f/0xa0
[   74.373340]  ubifs_fsync from generic_file_write_iter+0x8f/0x98
[   74.379632]  generic_file_write_iter from vfs_write+0xbd/0x100
[   74.385659]  vfs_write from ksys_write+0x45/0x68
[   74.390636]  ksys_write from ret_fast_syscall+0x1/0x16
[   74.395958] Exception stack(0xd183dfa8 to 0xd183dff0)
[   74.401352] dfa0:                   00000001 01a97008 00000001 01a97008 00000400 00000004
[   74.409709] dfc0: 00000001 01a97008 bef2abd8 00000004 00000000 0005b94c 00052ac9 00000000
[   74.418234] dfe0: 0005b1f0 bef2abd8 00015ecf b6f0e6f6
[   74.482699] spi-nor spi1.0: spi transfer canceled
[   74.487825] spi-nor spi1.0: SPI transfer failed: -512
[   74.493205] spi_master spi1: failed to transfer one message from queue
[   74.500095] ubi0 error: ubi_dump_flash: err -512 while reading 48 bytes from PEB 38:4224, read 0 bytes
[   74.521967] ubi0 warning: ubi_eba_write_leb: failed to write data to PEB 38
[   74.529254] ubi0 warning: ubi_ro_mode: switch to read-only mode

I think file systems need uninterruptible transfers to ensure that the filesystem
structure stays consistent. With interruptible transfers this is just not possible.

Maybe a device tree option could be added to enable interruptible behaviour for
people that really need it but it would be unsafe at least for filesystems. I'm
not sure if this is something which is acceptible by the kernel people. I doubt.

BTW, the "Internal error: Oops - undefined instruction: 0" in the JFFS2 crash
was likely a BUG macro. It uses an illegal instruction but the trap handler
should see that it is from a BUG. It is obviously broken on ARM32.

- ron

________________________________

Ce message, ainsi que tous les fichiers joints à ce message, peuvent contenir des informations sensibles et/ ou confidentielles ne devant pas être divulguées. Si vous n'êtes pas le destinataire de ce message (ou que vous recevez ce message par erreur), nous vous remercions de le notifier immédiatement à son expéditeur, et de détruire ce message. Toute copie, divulgation, modification, utilisation ou diffusion, non autorisée, directe ou indirecte, de tout ou partie de ce message, est strictement interdite.

This e-mail, and any document attached hereby, may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized, direct or indirect, copying, disclosure, distribution or other use of the material or parts thereof is strictly forbidden.