Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal

Miquel Raynal <miquel.raynal@xxxxxxxxxxx> · Tue, 5 Dec 2023 08:49:54 +0100

Hi Ronald,

> >>> I fear this is a jffs2 (summary feature) bug. Chances are great that yo  
> u're able
> >>> to trigger the very same using a sudden loss of power.
> >>>  
> >>>> It's not just spi-atmel, any spi-mem controller might be tempted to us  
> e
> >>>> interruptible^Wkillable transfers just because the timeout values can
> >>>> be really big as the memory sizes increase.
> >>>>
> >>>> One solution is to change the completion helpers back to something
> >>>> non-killable/non-interruptible, but the user experience will be
> >>>> slightly degraded. The other would be to look into jffs2 (if it's the
> >>>> only filesystem playing with signals during unmount, tbh I don't know)  
> .
> >>>> But maybe this signaling mechanism can't be hacked for compatibility
> >>>> reasons. Handling this at the spi level would be a mix of layers, I'm
> >>>> not ready for that.
> >>>>
> >>>> Richard, Mark, what's your opinion here?  
> >>>
> >>> I *think* we can remove the signal handling code from jffs2 since it ma  
> kes
> >>> already use of the kthread_should_stop() API.
> >>> That way we can keep the SPI transfer interruptible by signals.
> >>> ...reading right now into the history to figure better.  
> >>
> >> After a brief discussion with dwmw2 another question came up, if an spi transfer
> >> is cancelled, *all* other IO do the filesystem has to stop too.
> >> IO can happen concurrently, if only one IO path dies but the other ones can
> >> make progress, the filesystem becomes inconsistent and all hope is lost.
> >>
> >> Miquel, is this guaranteed by your changes?  
> >
> > Absolutely not, the changes are in a spi controller, there is nothing
> > specific to the user there. If a filesystem transfer get interrupted,
> > it's the filesystem responsibility to cancel the other IOs if that's
> > relevant for its own consistency?  
> 
> I think yes. But the only thing the FS can do is stop any writes from now
> on which is not a useful consequence of killing a process.
> 
> Anyway, the whole issue started with commit e0205d6203c2 "spi: atmel:
> Prevent
> false timeouts on long transfers". Citing the commit message here:
>      "spi: atmel: Prevent false timeouts on long transfers
> 
>      A slow SPI bus clocks at ~20MHz, which means it would transfer about
>      2500 bytes per second with a single data line. Big transfers, like whe
> n
>      dealing with flashes can easily reach a few MiB. The current DMA
> timeout
>      is set to 1 second, which means any working transfer of about 4MiB wil
> l
>      always be cancelled.
> 
>      With the above derivations, on a slow bus, we can assume every byte
> will
>      take at most 0.4ms. Said otherwise, we could add 4ms to the 1-second
>      timeout delay every 10kiB. On a 4MiB transfer, it would bring the
>      timeout delay up to 2.6s which still seems rather acceptable for a
>      timeout.
> 
>      The consequence of this is that long transfers might be allowed, which
>      hence requires the need to interrupt the transfer if wanted by the
>      user. We can hence switch to the _interruptible variant of
>      wait_for_completion. This leads to a little bit more handling to also
>      handle the interrupted case but looks really acceptable overall.
> 
>      While at it, we drop the useless, noisy and redundant WARN_ON() call."
> 
> This calculation is actually wrong by factor 1000. A 20MHz SPI bus is not
> really slow and it will transfer ~2.5MB/s over a single lane.
> The calculation would be right for 20kHz but I think this is a more
> esoteric case, isn't it?

Please, I would appreciate if you would adopt a more constructive
approach. Yes I am the ugly villain who broke your setup and I am sorry
about that. But let's face the reality:

- Filesystems being sensible to user signals is probably not an ideal
  choice but this was made a decade ago, so there is no blame here,
  but IMO it was not straightforward to think about this case.
  Anyway, as rightfully pointed out by David and Richard, there is the
  coherency problem of the filesystem, with which we don't want to
  play.

- The introduction you point above is indeed wrong by a factor
  1000, as the throughput should be 2.5MB/s and not 2.5kB/s, of
  course. But then if we transfer 2.5MB/s don't you feel like a
  transfer of 4MB is actually going to be interrupted for no reason?
  Sorry for messing the commit log, but the problem is real and the
  diff is relevant.

So instead of nervously insisting for a global revert, I believe the
right approach will be to accept 'big' timeout delays for now on spi
transfers and the final fix is probably to remove the
'interruptible/killable' keyword from our waits. And as (the other)
David said, maybe at some point we will decide to split spi transfers
if they are too big, even though heuristics in this case are not
straightforward.

Thanks,
Miquèl