Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal

Ronald Wahl <ronald.wahl@xxxxxxxxxxx> · Thu, 30 Nov 2023 21:43:32 +0100

On 30.11.23 21:15, Miquel Raynal wrote:
Hi Richard,

richard@xxxxxx wrote on Thu, 30 Nov 2023 19:26:07 +0100 (CET):

----- Ursprüngliche Mail -----
Von: "richard" <richard@xxxxxx>
An: "Miquel Raynal" <miquel.raynal@xxxxxxxxxxx>
CC: "Ronald Wahl" <ronald.wahl@xxxxxxxxxxx>, "Mark Brown" <broonie@xxxxxxxxxx>, "linux-spi" <linux-spi@xxxxxxxxxxxxxxx>,
"Thomas Petazzoni" <thomas.petazzoni@xxxxxxxxxxx>, "Ryan Wanner" <ryan.wanner@xxxxxxxxxxxxx>, "stable"
<stable@xxxxxxxxxxxxxxx>, "Richard Weinberger" <richard.weinberger@xxxxxxxxx>
Gesendet: Donnerstag, 30. November 2023 13:46:14
Betreff: Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal

----- Ursprüngliche Mail -----
Von: "Miquel Raynal" <miquel.raynal@xxxxxxxxxxx>
+ Richard, my dear jffs2 expert ;)

:-S

ronald.wahl@xxxxxxxxxxx wrote on Mon, 27 Nov 2023 18:54:40 +0100:

On 27.11.23 16:10, Ronald Wahl wrote:
On 27.11.23 10:58, Miquel Raynal wrote:
The intended move from wait_for_completion_*() to
wait_for_completion_interruptible_*() was to allow (very) long spi memor
y
transfers to be stopped upon user request instead of freezing the
machine forever as the timeout value could now be significantly bigger.

However, depending on the user logic, applications can receive many
signals for their own "internal" purpose and have nothing to do with the
requested kernel operations, hence interrupting spi transfers upon any
signal is probably not a wise choice. Instead, let's switch to
wait_for_completion_killable_*() to only catch the "important"
signals. This was likely the intended behavior anyway.

Actually this seems to work. But aborting a process that has a SPI
transfer running causes ugly messages from kernel. This is somehow
unexpected:

# dd if=/dev/urandom of=/flashdisk/testfile bs=1024 count=512
^C[  380.726760] spi-nor spi0.0: spi transfer canceled
[  380.731688] spi-nor spi0.0: SPI transfer failed: -512
[  380.737141] spi_master spi0: failed to transfer one message from queue
[  380.746495] spi-nor spi0.0: spi transfer canceled
[  380.751549] spi-nor spi0.0: SPI transfer failed: -512
[  380.756844] spi_master spi0: failed to transfer one message from queue

JFFS2 also logs an informational message which is less visible but also
may rise eyebrows:
[  380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed. retu
rned
-512, retlen 68

Ugly kernel messages are a normal consequence of killing an IO.
Chances are good that we'll find bugs in the upper layers.

Killing a process is something to expect in certain cases and it should
not cause such messages which may create some anxiety that something bad
had happened. So maybe the "kill" case should be silent (e.g. level
"debug")
but without out hiding real errors. But even when hiding the message in t
he
SPI framework it may cause additional messages in upper layers like JFFS2
.
I'm not sure whether all of this is a good idea. This is something others
have to decide.

... and now I just got a crash when unmounting and remounting jffs2:

unmount:
[ 8245.821105] spi-nor spi0.0: spi transfer canceled
[ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512
[ 8245.831508] spi_master spi0: failed to transfer one message from queue
[ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. returned -5
12, retlen 68
[ 8245.839786] spi-nor spi0.0: spi transfer canceled
[ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512
[ 8245.850145] spi_master spi0: failed to transfer one message from queue
[ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. returned -5
12, retlen 0
[ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty because the
flash driver returned retlen zero

jffs2 has a garbage collect thread which can be controlled using various
signals.
To terminate the thread, jffs2 sends SIGKILL upon umount.
If the gc thread does IO while that, you gonna kill the IO too.

mount:
[ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new ref 28b
d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x000ad578
-0x000ad578)
[ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] THUMB2

I fear this is a jffs2 (summary feature) bug. Chances are great that you're able
to trigger the very same using a sudden loss of power.

It's not just spi-atmel, any spi-mem controller might be tempted to use
interruptible^Wkillable transfers just because the timeout values can
be really big as the memory sizes increase.

One solution is to change the completion helpers back to something
non-killable/non-interruptible, but the user experience will be
slightly degraded. The other would be to look into jffs2 (if it's the
only filesystem playing with signals during unmount, tbh I don't know).
But maybe this signaling mechanism can't be hacked for compatibility
reasons. Handling this at the spi level would be a mix of layers, I'm
not ready for that.

Richard, Mark, what's your opinion here?

I *think* we can remove the signal handling code from jffs2 since it makes
already use of the kthread_should_stop() API.
That way we can keep the SPI transfer interruptible by signals.
...reading right now into the history to figure better.

After a brief discussion with dwmw2 another question came up, if an spi transfer
is cancelled, *all* other IO do the filesystem has to stop too.
IO can happen concurrently, if only one IO path dies but the other ones can
make progress, the filesystem becomes inconsistent and all hope is lost.

Miquel, is this guaranteed by your changes?

Absolutely not, the changes are in a spi controller, there is nothing
specific to the user there. If a filesystem transfer get interrupted,
it's the filesystem responsibility to cancel the other IOs if that's
relevant for its own consistency?

I think yes. But the only thing the FS can do is stop any writes from now
on which is not a useful consequence of killing a process.

Anyway, the whole issue started with commit e0205d6203c2 "spi: atmel:
Prevent
false timeouts on long transfers". Citing the commit message here:
    "spi: atmel: Prevent false timeouts on long transfers

    A slow SPI bus clocks at ~20MHz, which means it would transfer about
    2500 bytes per second with a single data line. Big transfers, like when
    dealing with flashes can easily reach a few MiB. The current DMA
timeout
    is set to 1 second, which means any working transfer of about 4MiB will
    always be cancelled.

    With the above derivations, on a slow bus, we can assume every byte
will
    take at most 0.4ms. Said otherwise, we could add 4ms to the 1-second
    timeout delay every 10kiB. On a 4MiB transfer, it would bring the
    timeout delay up to 2.6s which still seems rather acceptable for a
    timeout.

    The consequence of this is that long transfers might be allowed, which
    hence requires the need to interrupt the transfer if wanted by the
    user. We can hence switch to the _interruptible variant of
    wait_for_completion. This leads to a little bit more handling to also
    handle the interrupted case but looks really acceptable overall.

    While at it, we drop the useless, noisy and redundant WARN_ON() call."

This calculation is actually wrong by factor 1000. A 20MHz SPI bus is not
really slow and it will transfer ~2.5MB/s over a single lane.
The calculation would be right for 20kHz but I think this is a more
esoteric case, isn't it?

- ron

________________________________

Ce message, ainsi que tous les fichiers joints à ce message, peuvent contenir des informations sensibles et/ ou confidentielles ne devant pas être divulguées. Si vous n'êtes pas le destinataire de ce message (ou que vous recevez ce message par erreur), nous vous remercions de le notifier immédiatement à son expéditeur, et de détruire ce message. Toute copie, divulgation, modification, utilisation ou diffusion, non autorisée, directe ou indirecte, de tout ou partie de ce message, est strictement interdite.

This e-mail, and any document attached hereby, may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized, direct or indirect, copying, disclosure, distribution or other use of the material or parts thereof is strictly forbidden.