Hi Ronald, + Richard, my dear jffs2 expert ;) ronald.wahl@xxxxxxxxxxx wrote on Mon, 27 Nov 2023 18:54:40 +0100: > On 27.11.23 16:10, Ronald Wahl wrote: > > On 27.11.23 10:58, Miquel Raynal wrote: > >> The intended move from wait_for_completion_*() to > >> wait_for_completion_interruptible_*() was to allow (very) long spi memor > y > >> transfers to be stopped upon user request instead of freezing the > >> machine forever as the timeout value could now be significantly bigger. > >> > >> However, depending on the user logic, applications can receive many > >> signals for their own "internal" purpose and have nothing to do with the > >> requested kernel operations, hence interrupting spi transfers upon any > >> signal is probably not a wise choice. Instead, let's switch to > >> wait_for_completion_killable_*() to only catch the "important" > >> signals. This was likely the intended behavior anyway. > > > > Actually this seems to work. But aborting a process that has a SPI > > transfer running causes ugly messages from kernel. This is somehow > > unexpected: > > > > # dd if=/dev/urandom of=/flashdisk/testfile bs=1024 count=512 > > ^C[ 380.726760] spi-nor spi0.0: spi transfer canceled > > [ 380.731688] spi-nor spi0.0: SPI transfer failed: -512 > > [ 380.737141] spi_master spi0: failed to transfer one message from queue > > [ 380.746495] spi-nor spi0.0: spi transfer canceled > > [ 380.751549] spi-nor spi0.0: SPI transfer failed: -512 > > [ 380.756844] spi_master spi0: failed to transfer one message from queue > > > > JFFS2 also logs an informational message which is less visible but also > > may rise eyebrows: > > [ 380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed. retu > rned > > -512, retlen 68 > > > > Killing a process is something to expect in certain cases and it should > > not cause such messages which may create some anxiety that something bad > > had happened. So maybe the "kill" case should be silent (e.g. level > > "debug") > > but without out hiding real errors. But even when hiding the message in t > he > > SPI framework it may cause additional messages in upper layers like JFFS2 > . > > I'm not sure whether all of this is a good idea. This is something others > > have to decide. > > ... and now I just got a crash when unmounting and remounting jffs2: > > unmount: > [ 8245.821105] spi-nor spi0.0: spi transfer canceled > [ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512 > [ 8245.831508] spi_master spi0: failed to transfer one message from queue > [ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. returned -5 > 12, retlen 68 > [ 8245.839786] spi-nor spi0.0: spi transfer canceled > [ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512 > [ 8245.850145] spi_master spi0: failed to transfer one message from queue > [ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. returned -5 > 12, retlen 0 > [ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty because the flash driver returned retlen zero > > mount: > [ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new ref 28b > d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x000ad578 > -0x000ad578) > [ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] THUMB2 > [ 8831.234996] CPU: 0 PID: 1142 Comm: mount Not tainted 6.6.2-sama5 #1 > [ 8831.241587] Hardware name: Atmel SAMA5 > [ 8831.245478] PC is at jffs2_link_node_ref+0xe/0xe2 > [ 8831.250360] LR is at jffs2_link_node_ref+0xb9/0xe2 > [ 8831.255473] pc : [<c020969c>] lr : [<c0209747>] psr: 00000033 > [ 8831.261893] sp : c974dd78 ip : 00000000 fp : c09c6a5c > [ 8831.267428] r10: c1572f18 r9 : 0000e002 r8 : c2849964 > [ 8831.272801] r7 : 00001a44 r6 : 000ae5bc r5 : c14df4a8 r4 : c1620208 > [ 8831.279647] r3 : 00000001 r2 : 40000000 r1 : c090ce3c r0 : 00000093 > [ 8831.286340] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA Thumb Segme > nt none > [ 8831.293974] Control: 50c53c7d Table: 21470059 DAC: 00000051 > [ 8831.300025] Register r0 information: non-paged memory > [ 8831.305239] Register r1 information: non-slab/vmalloc memory > [ 8831.311220] Register r2 information: non-paged memory > [ 8831.316427] Register r3 information: non-paged memory > [ 8831.321630] Register r4 information: slab kmalloc-2k start c1620000 poin > ter offset 520 size 2048 > [ 8831.330796] Register r5 information: slab jffs2_refblock start c14df3e0 pointer offset 200 size 248 > [ 8831.340219] Register r6 information: non-paged memory > [ 8831.345422] Register r7 information: non-paged memory > [ 8831.350784] Register r8 information: slab kmalloc-4k start c2849000 poin > ter offset 2404 size 4096 > [ 8831.360032] Register r9 information: non-paged memory > [ 8831.365243] Register r10 information: slab pde_opener start c1572f18 poi > nter offset 0 size 24 > [ 8831.374137] Register r11 information: non-slab/vmalloc memory > [ 8831.380041] Register r12 information: NULL pointer > [ 8831.385140] Process mount (pid: 1142, stack limit = 0xb9d2bb39) > [ 8831.391215] Stack: (0xc974dd78 to 0xc974e000) > [ 8831.395884] dd60: 000ad578 000ae5bc > [ 8831.404242] dd80: 000ad578 000ad578 c2849dec c2805e00 c1620208 c2849dec 00000000 c021130d > [ 8831.412762] dda0: c1572d50 00000694 0000069c 00000036 00010000 c2937755 00000000 20061985 > [ 8831.421280] ddc0: 0000069c 00000000 c2849964 c2805e00 c1620208 c2849964 0000069c 00000008 > [ 8831.429799] dde0: 00000000 c2849ff8 c2849000 c020c40b c09c6a5c c974de34 c974de30 00000000 > [ 8831.438154] de00: 00000004 0000069c 00001000 c2937bc0 00000208 c1620000 00000000 000a0000 > [ 8831.446665] de20: 00000000 00000000 0000000a 00070000 c2849000 00000000 00001000 00000000 > [ 8831.455184] de40: 00000000 00000000 00010000 c2805e00 00000000 00000000 00200000 c2937e00 > [ 8831.463704] de60: 00000000 00008000 c0975a98 c020e02d c1001300 00000200 c020f3eb 00000dc0 > [ 8831.472062] de80: c2937e00 00000000 c020f3eb c2805e00 c156fc80 c29b9400 00200000 c2937e00 > [ 8831.480584] dea0: 00000000 c020f42d c29b9400 c156fc80 c09c6a64 c29b9400 c1298c00 c156fc80 > [ 8831.489100] dec0: c020f5c9 c02a853d 0000003a 00000000 c156fc80 c020f5c9 c974df58 c02a85ff > [ 8831.497458] dee0: c156fc80 c020f5c9 00000000 c020f5bf c156fc80 c156fc80 00000020 c016fdc3 > [ 8831.505976] df00: c156fc80 00000000 c1374780 c0115c75 00000000 00000000 c156fc80 c0182209 > [ 8831.514497] df20: c2937e00 c974df58 c2937f80 00000000 c2937e00 c2937f80 00008000 00000000 > [ 8831.523015] df40: 0006d71a 00008000 00068991 c0182307 00000000 c015048b c1360d90 c18b3a18 > [ 8831.531370] df60: 00001000 c2937f80 c2937e00 00000000 00000015 c0182539 00000000 0006d71a > [ 8831.539881] df80: 0006d725 00000000 0006d70c 0006d71a 00000015 c010023c c1374780 00000015 > [ 8831.548399] dfa0: 00068991 c0100041 00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000 > [ 8831.556908] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991 > [ 8831.565263] dfe0: b6ea606c beb82b9c 00039818 b6ea607c a0000010 0006d70c 00000000 00000000 > [ 8831.573779] jffs2_link_node_ref from jffs2_sum_scan_sumnode+0x1a5/0x320 > [ 8831.580860] jffs2_sum_scan_sumnode from jffs2_scan_medium+0x2d7/0xab0 > [ 8831.587595] jffs2_scan_medium from jffs2_do_mount_fs+0xeb/0x356 > [ 8831.593958] jffs2_do_mount_fs from jffs2_do_fill_super+0xf7/0x182 > [ 8831.600332] jffs2_do_fill_super from mtd_get_sb+0x61/0x98 > [ 8831.606176] mtd_get_sb from get_tree_mtd+0x4f/0xe8 > [ 8831.611241] get_tree_mtd from vfs_get_tree+0x13/0x7c > [ 8831.616648] vfs_get_tree from path_mount+0x409/0x4d4 > [ 8831.621891] path_mount from do_mount+0x33/0x40 > [ 8831.626593] do_mount from sys_mount+0xeb/0xfe > [ 8831.631367] sys_mount from ret_fast_syscall+0x1/0x5c > [ 8831.636591] Exception stack(0xc974dfa8 to 0xc974dff0) > [ 8831.641961] dfa0: 00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000 > [ 8831.650314] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991 > [ 8831.658818] dfe0: b6ea606c beb82b9c 00039818 b6ea607c > [ 8831.664195] Code: 6a63 b085 990a b903 (de02) 6ae5 > [ 8831.669135] ---[ end trace 0000000000000000 ]--- > [ 8831.673902] Kernel panic - not syncing: Fatal exception > > The previous aborted I/Os from a dd may also play a role here but the crash > is > clearly a cause of the interrupted transfers. It's a bit odd that it is an > undefined instruction but probably it was a stack overflow. > > Looking back in the SPI driver history I see some occasions where interrupt > ible > transfers were changed to non-interruptible transfers because filesystems l > ike > jffs2 used SIGKILL during unmount or maybe other reasons: > > 775c4c0032c4 "spi: stm32-qspi: remove signal sensitive on completion" > 26cfc0dbe43a "spi-zynq-qspi: use wait_for_completion_timeout to make zynq_q > spi_exec_mem_op not interruptible" > 7f3ac71ac3b0 "spi: davinci: fix spurious i/o error" > > Having them in spi-atmel now seems a bit off and even a source of misbehavi > our > and even crashes. It's not just spi-atmel, any spi-mem controller might be tempted to use interruptible^Wkillable transfers just because the timeout values can be really big as the memory sizes increase. One solution is to change the completion helpers back to something non-killable/non-interruptible, but the user experience will be slightly degraded. The other would be to look into jffs2 (if it's the only filesystem playing with signals during unmount, tbh I don't know). But maybe this signaling mechanism can't be hacked for compatibility reasons. Handling this at the spi level would be a mix of layers, I'm not ready for that. Richard, Mark, what's your opinion here? Thanks, Miquèl