Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal

Miquel Raynal <miquel.raynal@xxxxxxxxxxx> · Wed, 29 Nov 2023 09:49:32 +0100

Hi Ronald,

+ Richard, my dear jffs2 expert ;)

ronald.wahl@xxxxxxxxxxx wrote on Mon, 27 Nov 2023 18:54:40 +0100:

> On 27.11.23 16:10, Ronald Wahl wrote:
> > On 27.11.23 10:58, Miquel Raynal wrote:  
> >> The intended move from wait_for_completion_*() to
> >> wait_for_completion_interruptible_*() was to allow (very) long spi memor  
> y
> >> transfers to be stopped upon user request instead of freezing the
> >> machine forever as the timeout value could now be significantly bigger.
> >>
> >> However, depending on the user logic, applications can receive many
> >> signals for their own "internal" purpose and have nothing to do with the
> >> requested kernel operations, hence interrupting spi transfers upon any
> >> signal is probably not a wise choice. Instead, let's switch to
> >> wait_for_completion_killable_*() to only catch the "important"
> >> signals. This was likely the intended behavior anyway.  
> >
> > Actually this seems to work. But aborting a process that has a SPI
> > transfer running causes ugly messages from kernel. This is somehow
> > unexpected:
> >
> > # dd if=/dev/urandom of=/flashdisk/testfile bs=1024 count=512
> > ^C[  380.726760] spi-nor spi0.0: spi transfer canceled
> > [  380.731688] spi-nor spi0.0: SPI transfer failed: -512
> > [  380.737141] spi_master spi0: failed to transfer one message from queue
> > [  380.746495] spi-nor spi0.0: spi transfer canceled
> > [  380.751549] spi-nor spi0.0: SPI transfer failed: -512
> > [  380.756844] spi_master spi0: failed to transfer one message from queue
> >
> > JFFS2 also logs an informational message which is less visible but also
> > may rise eyebrows:
> > [  380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed. retu  
> rned
> > -512, retlen 68
> >
> > Killing a process is something to expect in certain cases and it should
> > not cause such messages which may create some anxiety that something bad
> > had happened. So maybe the "kill" case should be silent (e.g. level
> > "debug")
> > but without out hiding real errors. But even when hiding the message in t  
> he
> > SPI framework it may cause additional messages in upper layers like JFFS2  
> .
> > I'm not sure whether all of this is a good idea. This is something others
> > have to decide.  
> 
> ... and now I just got a crash when unmounting and remounting jffs2:
> 
> unmount:
> [ 8245.821105] spi-nor spi0.0: spi transfer canceled
> [ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512
> [ 8245.831508] spi_master spi0: failed to transfer one message from queue
> [ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. returned -5
> 12, retlen 68
> [ 8245.839786] spi-nor spi0.0: spi transfer canceled
> [ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512
> [ 8245.850145] spi_master spi0: failed to transfer one message from queue
> [ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. returned -5
> 12, retlen 0
> [ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty because the flash driver returned retlen zero
> 
> mount:
> [ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new ref 28b
> d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x000ad578
> -0x000ad578)
> [ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] THUMB2
> [ 8831.234996] CPU: 0 PID: 1142 Comm: mount Not tainted 6.6.2-sama5 #1
> [ 8831.241587] Hardware name: Atmel SAMA5
> [ 8831.245478] PC is at jffs2_link_node_ref+0xe/0xe2
> [ 8831.250360] LR is at jffs2_link_node_ref+0xb9/0xe2
> [ 8831.255473] pc : [<c020969c>]    lr : [<c0209747>]    psr: 00000033
> [ 8831.261893] sp : c974dd78  ip : 00000000  fp : c09c6a5c
> [ 8831.267428] r10: c1572f18  r9 : 0000e002  r8 : c2849964
> [ 8831.272801] r7 : 00001a44  r6 : 000ae5bc  r5 : c14df4a8  r4 : c1620208
> [ 8831.279647] r3 : 00000001  r2 : 40000000  r1 : c090ce3c  r0 : 00000093
> [ 8831.286340] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA Thumb  Segme
> nt none
> [ 8831.293974] Control: 50c53c7d  Table: 21470059  DAC: 00000051
> [ 8831.300025] Register r0 information: non-paged memory
> [ 8831.305239] Register r1 information: non-slab/vmalloc memory
> [ 8831.311220] Register r2 information: non-paged memory
> [ 8831.316427] Register r3 information: non-paged memory
> [ 8831.321630] Register r4 information: slab kmalloc-2k start c1620000 poin
> ter offset 520 size 2048
> [ 8831.330796] Register r5 information: slab jffs2_refblock start c14df3e0 pointer offset 200 size 248
> [ 8831.340219] Register r6 information: non-paged memory
> [ 8831.345422] Register r7 information: non-paged memory
> [ 8831.350784] Register r8 information: slab kmalloc-4k start c2849000 poin
> ter offset 2404 size 4096
> [ 8831.360032] Register r9 information: non-paged memory
> [ 8831.365243] Register r10 information: slab pde_opener start c1572f18 poi
> nter offset 0 size 24
> [ 8831.374137] Register r11 information: non-slab/vmalloc memory
> [ 8831.380041] Register r12 information: NULL pointer
> [ 8831.385140] Process mount (pid: 1142, stack limit = 0xb9d2bb39)
> [ 8831.391215] Stack: (0xc974dd78 to 0xc974e000)
> [ 8831.395884] dd60:                                                       000ad578 000ae5bc
> [ 8831.404242] dd80: 000ad578 000ad578 c2849dec c2805e00 c1620208 c2849dec 00000000 c021130d
> [ 8831.412762] dda0: c1572d50 00000694 0000069c 00000036 00010000 c2937755 00000000 20061985
> [ 8831.421280] ddc0: 0000069c 00000000 c2849964 c2805e00 c1620208 c2849964 0000069c 00000008
> [ 8831.429799] dde0: 00000000 c2849ff8 c2849000 c020c40b c09c6a5c c974de34 c974de30 00000000
> [ 8831.438154] de00: 00000004 0000069c 00001000 c2937bc0 00000208 c1620000 00000000 000a0000
> [ 8831.446665] de20: 00000000 00000000 0000000a 00070000 c2849000 00000000 00001000 00000000
> [ 8831.455184] de40: 00000000 00000000 00010000 c2805e00 00000000 00000000 00200000 c2937e00
> [ 8831.463704] de60: 00000000 00008000 c0975a98 c020e02d c1001300 00000200 c020f3eb 00000dc0
> [ 8831.472062] de80: c2937e00 00000000 c020f3eb c2805e00 c156fc80 c29b9400 00200000 c2937e00
> [ 8831.480584] dea0: 00000000 c020f42d c29b9400 c156fc80 c09c6a64 c29b9400 c1298c00 c156fc80
> [ 8831.489100] dec0: c020f5c9 c02a853d 0000003a 00000000 c156fc80 c020f5c9 c974df58 c02a85ff
> [ 8831.497458] dee0: c156fc80 c020f5c9 00000000 c020f5bf c156fc80 c156fc80 00000020 c016fdc3
> [ 8831.505976] df00: c156fc80 00000000 c1374780 c0115c75 00000000 00000000 c156fc80 c0182209
> [ 8831.514497] df20: c2937e00 c974df58 c2937f80 00000000 c2937e00 c2937f80 00008000 00000000
> [ 8831.523015] df40: 0006d71a 00008000 00068991 c0182307 00000000 c015048b c1360d90 c18b3a18
> [ 8831.531370] df60: 00001000 c2937f80 c2937e00 00000000 00000015 c0182539 00000000 0006d71a
> [ 8831.539881] df80: 0006d725 00000000 0006d70c 0006d71a 00000015 c010023c c1374780 00000015
> [ 8831.548399] dfa0: 00068991 c0100041 00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000
> [ 8831.556908] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991
> [ 8831.565263] dfe0: b6ea606c beb82b9c 00039818 b6ea607c a0000010 0006d70c 00000000 00000000
> [ 8831.573779]  jffs2_link_node_ref from jffs2_sum_scan_sumnode+0x1a5/0x320
> [ 8831.580860]  jffs2_sum_scan_sumnode from jffs2_scan_medium+0x2d7/0xab0
> [ 8831.587595]  jffs2_scan_medium from jffs2_do_mount_fs+0xeb/0x356
> [ 8831.593958]  jffs2_do_mount_fs from jffs2_do_fill_super+0xf7/0x182
> [ 8831.600332]  jffs2_do_fill_super from mtd_get_sb+0x61/0x98
> [ 8831.606176]  mtd_get_sb from get_tree_mtd+0x4f/0xe8
> [ 8831.611241]  get_tree_mtd from vfs_get_tree+0x13/0x7c
> [ 8831.616648]  vfs_get_tree from path_mount+0x409/0x4d4
> [ 8831.621891]  path_mount from do_mount+0x33/0x40
> [ 8831.626593]  do_mount from sys_mount+0xeb/0xfe
> [ 8831.631367]  sys_mount from ret_fast_syscall+0x1/0x5c
> [ 8831.636591] Exception stack(0xc974dfa8 to 0xc974dff0)
> [ 8831.641961] dfa0:                   00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000
> [ 8831.650314] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991
> [ 8831.658818] dfe0: b6ea606c beb82b9c 00039818 b6ea607c
> [ 8831.664195] Code: 6a63 b085 990a b903 (de02) 6ae5
> [ 8831.669135] ---[ end trace 0000000000000000 ]---
> [ 8831.673902] Kernel panic - not syncing: Fatal exception
> 
> The previous aborted I/Os from a dd may also play a role here but the crash
>  is
> clearly a cause of the interrupted transfers. It's a bit odd that it is an
> undefined instruction but probably it was a stack overflow.
> 
> Looking back in the SPI driver history I see some occasions where interrupt
> ible
> transfers were changed to non-interruptible transfers because filesystems l
> ike
> jffs2 used SIGKILL during unmount or maybe other reasons:
> 
> 775c4c0032c4 "spi: stm32-qspi: remove signal sensitive on completion"
> 26cfc0dbe43a "spi-zynq-qspi: use wait_for_completion_timeout to make zynq_q
> spi_exec_mem_op not interruptible"
> 7f3ac71ac3b0 "spi: davinci: fix spurious i/o error"
> 
> Having them in spi-atmel now seems a bit off and even a source of misbehavi
> our
> and even crashes.

It's not just spi-atmel, any spi-mem controller might be tempted to use
interruptible^Wkillable transfers just because the timeout values can
be really big as the memory sizes increase.

One solution is to change the completion helpers back to something
non-killable/non-interruptible, but the user experience will be
slightly degraded. The other would be to look into jffs2 (if it's the
only filesystem playing with signals during unmount, tbh I don't know).
But maybe this signaling mechanism can't be hacked for compatibility
reasons. Handling this at the spi level would be a mix of layers, I'm
not ready for that.

Richard, Mark, what's your opinion here?

Thanks,
Miquèl