Hey,
I'm trying to debug an issue on an embedded Linux (v4.4.107) where
repeatedly and quickly suspending (mem suspend) and resuming the system
causes one of the I2C controllers in the system to malfunction. I've run
out of ideas and would like to know if anyone recognizes my issue or can
provide clues to move forward with the debug.
Background: My system can be suspended and resumed using two buttons.
The buttons are attached to a GPIO expander, which in turn is connected
to the SoC via an I2C bus. The wake up button act as a wake-up source
for the kernel. When a button is pressed, the GPIO expander tiggers an
interrupt and the SoC will access the I2C bus to read out what button
was pressed. If I mash the buttons like a 2-year old would, it'll
eventually (within a minute or so) fail to suspend the system with an
error from the kernel "PM: noirq suspend of devices failed". Just before
this happens, I also see "controller timed out" errors coming from the
I2C controller driver in the kernel log. The device that fails to
suspend is the GPIO expander device and if I understand the kernel code
correctly, it is because an IRQ arrived just at the moment when suspend
is in progress. So it tries to process the IRQ before going to sleep,
but fails because the I2C controller is no longer working, so it is
unable to serve the IRQ and aborts suspend and the system is resumed. In
a way this is correct behaviour, the kernel is going to sleep but
receives an IRQ from the wake up source and then aborts the suspend.
BUT, it does not explain why the controller gets timeouts and why it
only happens sometimes. If I more gently suspend and resume (e.g no
spamming of buttons), it works great.
What is odd is that once the system is resumed again, the I2C controller
starts working again. But if I keep repeating the same procedure, the
system is no longer able to suspend -- the fail to suspend happens every
time and the system cannot go to sleep. Which is a disaster because this
is a battery-powered device. What's even worse is that sometimes the
GPIO expander stops working altogether, likely because it is a
IRQF_ONESHOT irq and when we are unable to process the IRQs (due to
broken I2C controller), it doesn't re-enable the IRQ anymore. I've been
able to verify this by successfully sending i2c messages from the cli to
the ADP5589 to poll its status, while IRQs from it is not arriving to
IRQ handler.
For reference, the I2C controller I'm using is Designware I2C. The
driver is drivers/i2c/busses/i2c-designware-*. The GPIO expander is a
ADP5589 and the driver I'm using is
drivers/input/keyboard/adp5589-keys.c. When the issue occurs, the
controller timeout
(https://elixir.bootlin.com/linux/v4.4.107/source/drivers/i2c/busses/i2c-designware-core.c#L659)
happens because an ongoing I2C transmit (as requested by the ADP5589 irq
handler) does not finish within 1 second.
I have connected a logic analyzer to the I2C pins and when the
controller timeout happens, I see that both SDA and SCL are pulled low.
They are kept low until the system is resumed and the controller
recovers. At first I thought this issue was a i2c bus fault, so I tried
implementing i2c bus recovery by remuxing the SDA and SCL pins to the
GPIO controller and then pulsing the SCL. However, as soon as I remux
the pins, the SCL and SDA are no longer getting pulled low. To me this
indicates that it is not one of the slaves that are hogging the bus, it
is the master. I can also tell from the controller status registers that
when the controller timeout occurs, the controller is not in an idle
state but it is also not getting the STOP bit interrupt nor anything
that would "complete" the transfer. It's stuck. I have looked upstream
in more recent kernels than 4.4 for fixes that would resolve this (and
there are quite a few commits that mention "controlled timed out" for
the designware driver), but so far nothing have worked.
Not even if I reset the whole controller (from the SoC syscontrol), it
will work until the system is fully resumed. Queuing new transactions
before system is suspended only makes the controller time out again.
This makes me wonder: what other part of the system gets suspended that
makes the i2c controller malfunction? And why does it not always happen?
Is not the suspend sequence executed the same way every time? (e.g order
of suspend)
Questions:
- If I call enable_irq_wake() on an IRQ, the IRQ should remain ON even
if the system is suspended. Will the kernel ensure all parent devices
are awaken before it invokes the device interrupt handler to serve a
wake up IRQ?If I put printk's in the kernel suspend code, it seems to me
that the ISR is called when more or less everything else is suspended /
turned off.
- I've tried to modify the I2C controller driver so that it never goes
to sleep, just as an experiment. I just set the PM ops to NULL and
changed the request_irq flags to IRQF_NO_SUSPEND; is this sufficient to
prevent the device from going to sleep?
If anyone have ideas on how to debug this issue, I'd greatly appreciate it.
Best regards, Magnus.
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@xxxxxxxxxxxxxxxxx
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies