Re: Boot hang with SiFive PLIC when routing I2C-HID level-triggered interrupts

Eva Kurchatova <nyandarknessgirl@xxxxxxxxx> · Mon, 18 Mar 2024 10:48:13 +0200

On Sun, Mar 17, 2024 at 11:27 PM Nam Cao <namcao@xxxxxxxxxxxxx> wrote:
>
> Cc: HID folks
>
> On 14/Mar/2024 Eva Kurchatova wrote:
> > If an I2C-HID controller level-triggered IRQ line is routed directly as
> > a PLIC IRQ, and we spam input early enough in kernel boot process
> > (Somewhere between initializing NET, ALSA subsystems and before
> > i2c-hid driver init), then there is a chance of kernel locking up
> > completely and not going any further.
> >
> > There are no kernel messages printed with all the IRQ, task hang
> > debugging enabled - other than (sometimes) it reports sched RT
> > throttling after a few seconds. Basic timer interrupt handling is
> > intact - fbdev tty cursor is still blinking.
> >
> > It appears that in such a case the I2C-HID IRQ line is raised; PLIC
> > notifies the (single) boot system hart, kernel claims the IRQ and
> > immediately completes it by writing to CLAIM/COMPLETE register.
> > No access to the I2C controller (OpenCores) or I2C-HID registers
> > is made, so the HID report is never consumed and IRQ line stays
> > raised forever. The kernel endlessly claims & completes IRQs
> > without doing any work with the device. It doesn't always end up this
> > way; sometimes boot process completes and there are no signs of
> > interrupt storm or stuck IRQ processing afterwards.
>
> It seems I2C HID's interrupt handler (i2c_hid_irq) returns immediately if
> I2C_HID_READ_PENDING is set. This flag is supposed to be cleared in
> i2c_hid_xfer(), but since the (threaded) interrupt handler runs at higher
> priority, the flag is never cleared. So we have a lock-up: interrupt
> handler won't do anything unless the flag is cleared, but the clearing of
> this flag is done in a lower priority task which never gets scheduled while
> the interrupt handler is active.
>
> There is RT throttling to prevent RT tasks from locking up the system like
> this. I don't know much about scheduling stuffs, so I am not really sure
> why RT throttling does not work. I think because RT throttling triggers
> when RT tasks take too much CPU time, but in this case hard interrupt
> handlers take lots of CPU time too (~50% according to my measurement), so
> RT throttling doesn't trigger often enough (in this case, it triggers once
> and never again). Again, I don't know much about scheduler so I may be
> talking nonsense here.
>
> The flag I2C_HID_READ_PENDING seems to be used to make sure that only 1
> I2C operation can happen at a time. But this seems pointless, because I2C
> subsystem already takes care of this. So I think we can just remove it.
>
> Can you give the below patch a try?
>
> diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c
> index 2735cd585af0..799ad0ef9c4a 100644
> --- a/drivers/hid/i2c-hid/i2c-hid-core.c
> +++ b/drivers/hid/i2c-hid/i2c-hid-core.c
> @@ -64,7 +64,6 @@
>  /* flags */
>  #define I2C_HID_STARTED                0
>  #define I2C_HID_RESET_PENDING  1
> -#define I2C_HID_READ_PENDING   2
>
>  #define I2C_HID_PWR_ON         0x00
>  #define I2C_HID_PWR_SLEEP      0x01
> @@ -190,15 +189,10 @@ static int i2c_hid_xfer(struct i2c_hid *ihid,
>                 msgs[n].len = recv_len;
>                 msgs[n].buf = recv_buf;
>                 n++;
> -
> -               set_bit(I2C_HID_READ_PENDING, &ihid->flags);
>         }
>
>         ret = i2c_transfer(client->adapter, msgs, n);
>
> -       if (recv_len)
> -               clear_bit(I2C_HID_READ_PENDING, &ihid->flags);
> -
>         if (ret != n)
>                 return ret < 0 ? ret : -EIO;
>
> @@ -566,9 +560,6 @@ static irqreturn_t i2c_hid_irq(int irq, void *dev_id)
>  {
>         struct i2c_hid *ihid = dev_id;
>
> -       if (test_bit(I2C_HID_READ_PENDING, &ihid->flags))
> -               return IRQ_HANDLED;
> -
>         i2c_hid_get_input(ihid);
>
>         return IRQ_HANDLED;

Patch applied cleanly on top of 6.7.9, builds OK (No warns, etc).

This indeed fixes the hang completely.

I modified RVVM to send millions of keystroke events per second,
and put `reboot` as a service hook in the guest. It has been continuously
rebooting without a hitch for the last 30 minutes or so (Full boot takes
around 2 seconds), whereas unpatched 6.7.9 hangs almost immediately
in such conditions (Reverted your patch & rebuilt to be sure).

Thank you very much for this! Hope to see it upstreamed soon