Hi, On Mon Feb 5, 2024 at 10:03 AM CET, Miquel Raynal wrote: > Hello Théo, > > theo.lebrun@xxxxxxxxxxx wrote on Fri, 02 Feb 2024 18:29:40 +0100: > > > The ->runtime_suspend() and ->runtime_resume() callbacks are not > > expected to call spi_controller_suspend() and spi_controller_resume(). > > Remove calls to those in the cadence-qspi driver. > > > > Those helpers have two roles currently: > > - They stop/start the queue, including dealing with the kworker. > > - They toggle the SPI controller SPI_CONTROLLER_SUSPENDED flag. It > > requires acquiring ctlr->bus_lock_mutex. > > > > The cadence-qspi ->exec_op() implementation bumps the usage counter at > > its start. It might therefore run our ->runtime_resume() > > implementation. However, ctlr->bus_lock_mutex is acquired by > > spi_mem_exec_op() while ->exec_op() is being called. > > > > Here is a brief call tree highlighting the issue: > > > > spi_mem_exec_op() > > ... > > spi_mem_access_start() > > mutex_lock(&ctlr->bus_lock_mutex) > > > > cqspi_exec_mem_op() > > pm_runtime_resume_and_get() > > cqspi_resume() > > spi_controller_resume() > > mutex_lock(&ctlr->bus_lock_mutex) > > ... > > > > spi_mem_access_end() > > mutex_unlock(&ctlr->bus_lock_mutex) > > ... > > > > The fatal conclusion of this is a deadlock: we acquire a lock on each > > operation but while running the operation, we might want to runtime > > resume and acquire the same lock. > > > > Anyway, those helpers (spi_controller_{suspend,resume}) are aimed at > > system-wide suspend and resume and should NOT be called at runtime > > suspend & resume. > > > > Side note: the previous implementation had a second issue. It acquired a > > pointer to both `struct cqspi_st` and `struct spi_controller` using > > dev_get_drvdata(). Neither embed the other. This lead to memory > > corruption that was being hidden inside the big cqspi->f_pdata array on > > my setup. It was working until I tried changing the array side to its > > theorical max of 4, which lead to the discovery of this gnarly bug. > > > > Fixes: 0578a6dbfe75 ("spi: spi-cadence-quadspi: add runtime pm support") > > Fixes: 2087e85bb66e ("spi: cadence-quadspi: fix suspend-resume implementations") > > Your commit log makes total sense but I believe the diff is gonna break > again the suspend to RAM operation. This is only my understanding > right after quickly going through the whole story, so maybe I'm > totally off topic. The current ->runtime_suspend() implementation would indeed (probably) work for suspend-to-RAM if it wasn't for the wrong pointers to cqspi and spi_controller (see side note from commit message). I've not found a moment where `struct cqspi_st` embed `struct spi_controller` at its start, so I do not believe this has ever worked. It might be the result of a mistake while porting a patch from a branch that included other changes. > What happened if I understand the two commits blamed above: > > - There were PM hooks. > - Someone turned them into runtime PM hooks (breaking regular > suspend/resume). > - Someone else added the "missing" suspend/resume logic inside the > runtime PM hooks to fix suspend and resume. > - You are removing this logic because it leads to deadlocks. > > There was likely a misconception of what is expected in both cases > (quick and small power savings vs. full power cycle/loosing the whole > configuration). > > I would propose instead to create two distinct set of functions: > - One for runtime PM > - One for suspend/resume > This way the runtime PM no longer deadlocks and people using > suspend/resume won't get affected? I don't know if your runtime hooks > *will* always be called during a suspend/resume. I hope so, which would > make the split quite easy and without any code duplication. That does indeed sound like the right approach. Runtime hooks can be called from suspend/resume if needs be. Runtime PM then gets disabled at the late stage. I do not believe currently system-wide suspend can be working. spi_controller_{suspend,resume} are being called with a bogus pointer. This makes me ask: should the system-wide suspend/resume part be addressed with this patch or a follow-up? It feels like a separate concern to me. The nice thing is that I have easy access to J7200, which uses the same controller and supports suspend-to-RAM. That should make it a good test setup. Thanks, -- Théo Lebrun, Bootlin Embedded Linux and Kernel engineering https://bootlin.com