Re: Boot hang with SiFive PLIC when routing I2C-HID level-triggered interrupts

Conor Dooley <conor@xxxxxxxxxx> · Thu, 14 Mar 2024 21:46:40 +0000

Hey,

I'm not really all that familar with the plic driver itself, so adding
Samuel and Thomas who will (hopefully) understand this better than me.

On Thu, Mar 14, 2024 at 09:12:40AM +0200, Eva Kurchatova wrote:
> If an I2C-HID controller level-triggered IRQ line is routed directly as
> a PLIC IRQ, and we spam input early enough in kernel boot process
> (Somewhere between initializing NET, ALSA subsystems and before
> i2c-hid driver init), then there is a chance of kernel locking up
> completely and not going any further.
> 
> There are no kernel messages printed with all the IRQ, task hang
> debugging enabled - other than (sometimes) it reports sched RT
> throttling after a few seconds. Basic timer interrupt handling is
> intact - fbdev tty cursor is still blinking.
> 
> It appears that in such a case the I2C-HID IRQ line is raised; PLIC
> notifies the (single) boot system hart, kernel claims the IRQ and
> immediately completes it by writing to CLAIM/COMPLETE register.
> No access to the I2C controller (OpenCores) or I2C-HID registers
> is made,

This immediately seemed odd to me, but I have no reason to disbelieve
you, given you say this was discovered in RVVM which is an emulator and
you should know whether or not registers are accessed.
The very first action taken by the ocores i2c controller driver when it
gets an interrupt though is to read a register:

	u8 stat = oc_getreg(i2c, OCI2C_STATUS);

I would expect that this handler would be called, and therefore you'd
see the register read, had the probe function of that driver run to
completion. I'd also expect that the interrupt would not even be
unmasked if that probe function had failed.
In your case though, you can see that the interrupt is not masked,
since it is being raised and handled repeatedly by the PLIC driver.
Has the i2c controller driver probed in the period of boot that you say
this problem manifests?

> so the HID report is never consumed and IRQ line stays
> raised forever. The kernel endlessly claims & completes IRQs
> without doing any work with the device. It doesn't always end up this
> way; sometimes boot process completes and there are no signs of
> interrupt storm or stuck IRQ processing afterwards.
> 
> There was a suspicion this has to do with SiFive PLIC being
> not-so-explicit about level triggered interrupts. The research of this
> issue led this way: There is another DT PLIC binding; a THead one,
> and it has a flag `PLIC_QUIRK_EDGE_INTERRUPT` which allows
> to define IRQ source behavior as 2-cells in DT; and has some other
> changes to the logic (more on that below).
> When attempting to mimic a THead PLIC in kernel DT, and rewriting
> all DT interrupt sources to use 2-cell description, the hang ceases to
> happen. Curious as to what are the kernel side implications of this,
> I went to see what `PLIC_QUIRK_EDGE_INTERRUPT` actually does and
> bit-by-bit disabled the actual differences this flag makes in the
> driver logic.
> 
> This return path in irq-sifive-plic.c@223
> (https://elixir.bootlin.com/linux/latest/source/drivers/irqchip/irq-sifive-plic.c#L223)
> is only enabled for SiFive PLIC, but not for THead one. Removing
> those 2 lines of code from the driver (whilst keeping the DT binding
> properly reporting a SiFive PLIC) fixes the hang. I am not an expert
> on the PLIC driver to debug further or determine what would be a
> proper fix to this, but this probably gets more experienced devs
> somewhere (I hope).

I'm not really familiar with this code either, but just checking what
the affect of your changes are, AFAICT it just sets the handler to be
handle_fasteoi_irq(), I noticed that that function has some code that
will mask the irq if there's no handler registered for that particular
interrupt:
https://elixir.bootlin.com/linux/latest/source/kernel/irq/chip.c#L710

It seems like in your case there might not be one registered (as the
i2c controller's interrupt handler never performs it's first access),
so I'm wondering if that masking of the interrupt when no action is
registered is what solves the problem for you.

That's mostly just speculation though, because I am not an expert on the
PLIC driver either.

> This is reproducible at least from Linux 6.4.1 to Linux 6.7.9 on RVVM;

I clearly cannot make any definitive statements because I'm just
speculating here after all based on this mail, as there's no logs and I
have not tried to reproduce this, but this does seem like the interrupt
is unmasked before the i2c controller driver has even requested it.
Ordinarily (at least on the hardware I have done any testing of
interrupts on) the interrupts are masked by default and only get
unmasked when there's a user for it in the kernel.

Are interrupts unmasked by default on RVVM?

> Affects any hardware that would have SiFive PLIC + I2C-HID combination;

Have you checked that this actually affects any actual hardware?

Thanks,
Conor.

> Most likely this is reproducible on QEMU as well if it had i2c-hid emulation,
> or if we passthrough physical I2C-HID device & inject PLIC IRQs from
> it's IRQ line.
Attachment:
signature.asc

Description: PGP signature