On 11/30/20 9:55 AM, Halil Pasic wrote: > On Mon, 30 Nov 2020 09:30:33 +0100 > Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote: > >> I'm not really familiar, with it but I think this is closely related >> to what I asked Bernd Nerz. I fear that if CPUs go away we might already >> be in trouble at the firmware/hardware/platform level because the CPU Address is >> "programmed into the device" so to speak. Thus a directed interrupt from >> a device may race with anything reordering/removing CPUs even if >> CPU addresses of dead CPUs are not reused and the mapping is stable. > > From your answer, I read that CPU hot-unplug is supported for LPAR. I'm not sure about hot hot-unplug and firmware telling us about removed CPUs but at the very least there is: echo 0 > /sys/devices/system/cpu/cpu6/online >> >> Furthermore our floating fallback path will try to send a SIGP >> to the target CPU which clearly doesn't work when that is permanently >> gone. Either way I think these issues are out of scope for this fix >> so I will go ahead and merge this. > > I agree, it makes on sense to delay this fix. > > But if CPU hot-unplug is supported, I believe we should react when > a CPU is unplugged, that is a target of directed interrupts. My guess > is, that in this scenario transient hiccups are unavoidable, and thus > should be accepted, but we should make sure that we recover. I agree, I just tested the above command on a firmware test system and deactivated 4 of 8 CPUs. This is in /proc/interrupts after that: ... 3: 9392 0 0 0 PCI-MSI mlx5_async@pci:0001:00:00.0 4: 282741 0 0 0 PCI-MSI mlx5_comp0@pci:0001:00:00.0 5: 0 2 0 0 PCI-MSI mlx5_comp1@pci:0001:00:00.0 6: 0 0 104 0 PCI-MSI mlx5_comp2@pci:0001:00:00.0 7: 0 0 0 2 PCI-MSI mlx5_comp3@pci:0001:00:00.0 8: 0 0 0 0 PCI-MSI mlx5_comp4@pci:0001:00:00.0 9: 0 0 0 0 PCI-MSI mlx5_comp5@pci:0001:00:00.0 10: 0 0 0 0 PCI-MSI mlx5_comp6@pci:0001:00:00.0 11: 0 0 0 0 PCI-MSI mlx5_comp7@pci:0001:00:00.0 ... So it looks like we are left with registered interrupts for CPUs which are offline. However I'm not sure how to trigger a problem with that. I think the drivers would usually only do a directed interrupt to a CPU that is currently running the process that triggered the I/O (I tested this assumption with "taskset -c 2 ping ..."). Now with the CPU offline there cannot be such a process. So I think for the most part the queue would just remain unused. Still, if we do get a directed interrupt for it's my understanding that currently we will lose that. I think this could be fixed with something I tried in a prototype code a while back that is in zpci_handle_fallback_irq() I handled the IRQ locally. Back then it looked like Directed IRQs would make it to z15 GA 1.5 and this was done to help Bernd to debug a Millicode issue (Jup 905371). I also had a version of that code meant as a possible performance improvement that would check if the target CPU is available and only then send the SIGP and otherwise handle it locally. > > Regards, > Halil >