Hi Richard, richard@xxxxxx wrote on Thu, 30 Nov 2023 19:26:07 +0100 (CET): > ----- Ursprüngliche Mail ----- > > Von: "richard" <richard@xxxxxx> > > An: "Miquel Raynal" <miquel.raynal@xxxxxxxxxxx> > > CC: "Ronald Wahl" <ronald.wahl@xxxxxxxxxxx>, "Mark Brown" <broonie@xxxxxxxxxx>, "linux-spi" <linux-spi@xxxxxxxxxxxxxxx>, > > "Thomas Petazzoni" <thomas.petazzoni@xxxxxxxxxxx>, "Ryan Wanner" <ryan.wanner@xxxxxxxxxxxxx>, "stable" > > <stable@xxxxxxxxxxxxxxx>, "Richard Weinberger" <richard.weinberger@xxxxxxxxx> > > Gesendet: Donnerstag, 30. November 2023 13:46:14 > > Betreff: Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal > > > ----- Ursprüngliche Mail ----- > >> Von: "Miquel Raynal" <miquel.raynal@xxxxxxxxxxx> > >> + Richard, my dear jffs2 expert ;) > > > > :-S > > > >> > >> ronald.wahl@xxxxxxxxxxx wrote on Mon, 27 Nov 2023 18:54:40 +0100: > >> > >>> On 27.11.23 16:10, Ronald Wahl wrote: > >>> > On 27.11.23 10:58, Miquel Raynal wrote: > >>> >> The intended move from wait_for_completion_*() to > >>> >> wait_for_completion_interruptible_*() was to allow (very) long spi memor > >>> y > >>> >> transfers to be stopped upon user request instead of freezing the > >>> >> machine forever as the timeout value could now be significantly bigger. > >>> >> > >>> >> However, depending on the user logic, applications can receive many > >>> >> signals for their own "internal" purpose and have nothing to do with the > >>> >> requested kernel operations, hence interrupting spi transfers upon any > >>> >> signal is probably not a wise choice. Instead, let's switch to > >>> >> wait_for_completion_killable_*() to only catch the "important" > >>> >> signals. This was likely the intended behavior anyway. > >>> > > >>> > Actually this seems to work. But aborting a process that has a SPI > >>> > transfer running causes ugly messages from kernel. This is somehow > >>> > unexpected: > >>> > > >>> > # dd if=/dev/urandom of=/flashdisk/testfile bs=1024 count=512 > >>> > ^C[ 380.726760] spi-nor spi0.0: spi transfer canceled > >>> > [ 380.731688] spi-nor spi0.0: SPI transfer failed: -512 > >>> > [ 380.737141] spi_master spi0: failed to transfer one message from queue > >>> > [ 380.746495] spi-nor spi0.0: spi transfer canceled > >>> > [ 380.751549] spi-nor spi0.0: SPI transfer failed: -512 > >>> > [ 380.756844] spi_master spi0: failed to transfer one message from queue > >>> > > >>> > JFFS2 also logs an informational message which is less visible but also > >>> > may rise eyebrows: > >>> > [ 380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed. retu > >>> rned > >>> > -512, retlen 68 > > > > Ugly kernel messages are a normal consequence of killing an IO. > > Chances are good that we'll find bugs in the upper layers. > > > >>> > Killing a process is something to expect in certain cases and it should > >>> > not cause such messages which may create some anxiety that something bad > >>> > had happened. So maybe the "kill" case should be silent (e.g. level > >>> > "debug") > >>> > but without out hiding real errors. But even when hiding the message in t > >>> he > >>> > SPI framework it may cause additional messages in upper layers like JFFS2 > >>> . > >>> > I'm not sure whether all of this is a good idea. This is something others > >>> > have to decide. > >>> > >>> ... and now I just got a crash when unmounting and remounting jffs2: > >>> > >>> unmount: > >>> [ 8245.821105] spi-nor spi0.0: spi transfer canceled > >>> [ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512 > >>> [ 8245.831508] spi_master spi0: failed to transfer one message from queue > >>> [ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. returned -5 > >>> 12, retlen 68 > >>> [ 8245.839786] spi-nor spi0.0: spi transfer canceled > >>> [ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512 > >>> [ 8245.850145] spi_master spi0: failed to transfer one message from queue > >>> [ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. returned -5 > >>> 12, retlen 0 > >>> [ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty because the > >>> flash driver returned retlen zero > > > > jffs2 has a garbage collect thread which can be controlled using various > > signals. > > To terminate the thread, jffs2 sends SIGKILL upon umount. > > If the gc thread does IO while that, you gonna kill the IO too. > > > >>> mount: > >>> [ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new ref 28b > >>> d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x000ad578 > >>> -0x000ad578) > >>> [ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] THUMB2 > > > > > > I fear this is a jffs2 (summary feature) bug. Chances are great that you're able > > to trigger the very same using a sudden loss of power. > > > >> It's not just spi-atmel, any spi-mem controller might be tempted to use > >> interruptible^Wkillable transfers just because the timeout values can > >> be really big as the memory sizes increase. > >> > >> One solution is to change the completion helpers back to something > >> non-killable/non-interruptible, but the user experience will be > >> slightly degraded. The other would be to look into jffs2 (if it's the > >> only filesystem playing with signals during unmount, tbh I don't know). > >> But maybe this signaling mechanism can't be hacked for compatibility > >> reasons. Handling this at the spi level would be a mix of layers, I'm > >> not ready for that. > >> > >> Richard, Mark, what's your opinion here? > > > > I *think* we can remove the signal handling code from jffs2 since it makes > > already use of the kthread_should_stop() API. > > That way we can keep the SPI transfer interruptible by signals. > > ...reading right now into the history to figure better. > > After a brief discussion with dwmw2 another question came up, if an spi transfer > is cancelled, *all* other IO do the filesystem has to stop too. > IO can happen concurrently, if only one IO path dies but the other ones can > make progress, the filesystem becomes inconsistent and all hope is lost. > > Miquel, is this guaranteed by your changes? Absolutely not, the changes are in a spi controller, there is nothing specific to the user there. If a filesystem transfer get interrupted, it's the filesystem responsibility to cancel the other IOs if that's relevant for its own consistency? Thanks, Miquèl