Re: [PATH stable 5.15,5.10 0/4] Fix EBS volume attach on AWS ARM instances

Marc Zyngier <maz@xxxxxxxxxx> · Mon, 28 Nov 2022 17:53:55 +0000

On Mon, 28 Nov 2022 17:08:31 +0000,
Luiz Capitulino <luizcap@xxxxxxxxxx> wrote:
> 
> Hi,
> 
> [ Marc, can you help reviewing? Esp. the first patch? ]
> 
> This series of backports from upstream to stable 5.15 and 5.10 fixes an issue
> we're seeing on AWS ARM instances where attaching an EBS volume (which is a
> nvme device) to the instance after offlining CPUs causes the device to take
> several minutes to show up and eventually nvme kworkers and other threads start
> getting stuck.
> 
> This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it
> on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.

That's because x86 has a very different allocation policy compared to
what the ITS does. The x86 vector space is tiny, so vectors are only
allocated when required. In your case, that's when the CPUs are
onlined.

With the ITS, all the vectors are allocated upfront, as this is
essentially free. But in the case of managed interrupts, these vectors
are now pointing to offline CPUs. The ITS tries to fix that, but
doesn't nearly have enough information. And the correct course of
action is to keep these interrupts in the shutdown state, which is
what the series is doing.

>
> An easy reproducer is:
> 
> 1. Start an ARM instance with 32 CPUs

To satisfy my own curiosity, is that in a guest or bare metal? It
shouldn't make any difference, but hey...

Anyway, patch #1 looks OK to me, but I haven't tried to dig further
into something that is "oh so last year" ;-). Specially as we're
rewriting the whole of the MSI stack! FWIW:

Acked-by: Marc Zyngier <maz@xxxxxxxxxx>

	M.

-- 
Without deviation from the norm, progress is not possible.