From: Michael Kelley <mhklinux@xxxxxxxxxxx> This patch set makes significant changes in how Linux guests running on Hyper-V handle interrupts from VMBus devices. It's "RFC" because of the level of change, and because I'm interested in macro-level feedback on the whole idea. The usual detailed-level code review feedback is also welcome. Background ---------- Linux guests running on Hyper-V are presented with a range of synthetic devices, including storage, networking, mouse, keyboard, etc. Guests are also presented with management-related pseudo-devices, including time synchronization, heartbeat, and host-initiated shutdown. These devices use the VMBus connection between the host and the guest, and each device has one or more VMBus channels for passing messages between the host and guest. Separately, a control path exists for creating and tearing down VMBus channels when needed. Hyper-V VMBus and its synthetic devices play a role similar to virtio and virtio devices. The host interrupts the guest when it sends a new message to the guest in an otherwise empty channel, and when it sends a new control message to the guest. All interrupts are multiplexed onto a single per-CPU architectural interrupt as defined by the processor architecture. On arm64, this architectural interrupt is a PPI, and is represented in Linux as a Linux per-CPU IRQ. The x86/x64 architecture does not provide per-CPU interrupts, so Linux reserves a fixed x86 interrupt vector (HYPERVISOR_CALLBACK_VECTOR) on all CPUs, with a hard-coded interrupt handler. No Linux IRQ is assigned on x86/x64. The interrupt handler for the architectural interrupt is the VMBus interrupt handler, and it runs whenever the host generates an interrupt for a channel or a control message. The VMBus interrupt handler consults data structures provided by the per-CPU Hyper-V synthetic interrupt controller (synic) to determine which channel or channels have a pending interrupt, or if there is a pending control message. When doing this demultiplexing, the VMBus interrupt handler directly invokes the per-device handlers, and processes any control messages. In this scheme, the individual VMBus devices and their channels are not modelled as Linux IRQs. /proc/interrupts shows counts for the top-level architectural interrupt, but not for the individual VMBus devices. The VMBus driver has implemented separate reporting of per-device data under /sys. Similarly, /proc/irq/<nn> does not show affinity information for individual VMBus devices, and the affinity cannot be changed via /proc/irq. Again, a separate interface under /sys is provided for changing the vCPU that a VMBus channel should interrupt. What's New ---------- Linux kernel community members have commented that it would be better for the Hyper-V synic and the handling of VMBus channel interrupts to be modelled as Linux IRQs. Then they would participate in existing IRQ handling mechanisms, and not need something separate under /sys. This patch set provides such an implementation. With this patch set, the output of /proc/interrupts looks like this in a Linux guest on a Hyper-V x86/x64 Generation 2 VM with 8 vCPUs (the columns for CPUs 2 thru 5 are omitted due to text width considerations): CPU0 CPU1 CPU6 CPU7 8: 0 0 0 0 IO-APIC 8-edge rtc0 9: 2 0 0 0 IO-APIC 9-fasteoi acpi 24: 16841 0 0 0 VMBus 7 heartbeat 25: 1 0 0 0 VMBus 9 shutdown 26: 6752 0 0 0 VMBus 10 timesync 27: 1 0 0 0 VMBus 11 vss 28: 22095 0 0 0 VMBus 14 pri@storvsc1 29: 0 85 0 0 VMBus 15 pri@storvsc0 30: 3 0 0 0 VMBus 1 balloon 31: 106 0 0 0 VMBus 3 mouse 32: 73 0 0 0 VMBus 4 keyboard 33: 6 0 0 0 VMBus 5 framebuffer 34: 0 0 0 0 VMBus 13 netvsc 35: 1 0 0 0 VMBus 8 kvp 36: 0 0 0 0 VMBus 16 netvsc 37: 0 0 0 0 VMBus 17 netvsc 38: 0 0 0 0 VMBus 18 netvsc 39: 0 0 2375 0 VMBus 19 netvsc 40: 0 0 0 1178 VMBus 20 netvsc 41: 1003 0 0 0 VMBus 21 netvsc 42: 0 4337 0 0 VMBus 22 netvsc 43: 0 0 0 0 VMBus 23 sub@storvsc1 44: 0 0 0 0 VMBus 24 sub@storvsc0 NMI: 0 0 0 0 Non-maskable interrupts LOC: 0 0 0 0 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 Performance monitoring interrupts IWI: 1 0 0 2 IRQ work interrupts RTR: 0 0 0 0 APIC ICR read retries RES: 411 233 349 318 Rescheduling interrupts CAL: 135236 29211 99526 36424 Function call interrupts TLB: 0 0 0 0 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts DFR: 0 0 0 0 Deferred Error APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 109 109 109 109 Machine check polls HYP: 71 0 0 0 Hypervisor callback interrupts HRE: 0 0 0 0 Hyper-V reenlightenment interrupts HVS: 183391 109695 181539 339239 Hyper-V stimer0 interrupts ERR: 0 MIS: 0 PIN: 0 0 0 0 Posted-interrupt notification event NPI: 0 0 0 0 Nested posted-interrupt event PIW: 0 0 0 0 Posted-interrupt wakeup event IRQs 24 thru 44 are the VMBus channel IRQs that didn't previously exist. Some devices, such as storvsc and netvsc, have multiple channels, each of which has a separate IRQ. The HYP line now shows counts only for control messages instead of for all VMBus interrupts. The HVS line continues to show Hyper-V synthetic timer (stimer0) interrupts. (In older versions of Hyper-V where stimer0 interrupts are delivered as control messages, those interrupts are accounted for on the HYP line. Since this is legacy behavior, it seemed unnecessary to create a separate Linux IRQ to model these messages.) The basic approach is to update the VMBus channel create/open/close/delete infrastructure, and the VMBus interrupt handler, to create and process the additional IRQs. The goal is that individual VMBus drivers require no code changes to work with the new IRQs. However, a driver may optionally annotate the IRQ using knowledge that only the driver has. For example, in the above /proc/interrupts output, the storvsc driver for synthetic disks has been updated to distinguish the primary channel from the subchannels, and to distinguish controller 0 from controller 1. Since storvsc models a SCSI controller, the numeric designations are set to match the "host0" and "host1" SCSI controllers that could be seen with "lsscsi -H -v". Similarly annotating the "netvsc" entries is a "to do" item. In furtherance of the "no changes to existing VMBus drivers" goal, all VMBus IRQs use the same interrupt handler. Each channel has a callback function associated with it, and the interrupt handler invokes this callback in the same way as before VMBus IRQs were introduced. VMBus code creates a standalone linear IRQ domain for the VMBus IRQs. The hwirq #'s are the VMBus channel relid's, which are integers in the range 1 to 2047. Each relid uniquely identifies a VMBus channel, and a channel interrupt from the host provides the relid. A mapping from IRQ to hwirq (relid) is created when a new channel is created, and the mapping is deleted when the channel is deleted, so adding and removing VMBus devices in a running VM is fully functional. Getting the IRQ counting correct requires a new IRQ flow handler in /kernel/irq/chip.c. See Patch 6. The existing flow handlers don't handle this case, and creating a custom flow handler outside of chip.c isn't feasible because the needed components aren't available to code outside of /kernel/irq. When a guest CPU is taken offline, Linux automatically changes the affinity of IRQs assigned to that CPU so that the IRQ is handled on another CPU that is still online. Unfortunately, VMBus channel IRQs are not able to take advantage of that mechanism. At the point the mechanism runs, it isn't possible to know when Hyper-V started interrupting the new CPU; hence pending interrupts may be stranded on the old offline CPU. Existing VMBus code in Linux prevents a CPU from going offline when a VMBus channel interrupt is affined to it, and that code must be retained. Of course, VMBus channel IRQs can manually be affined to a different CPU, which could then allow a CPU to be taken offline. I have an idea for a somewhat convoluted way to allow the automatic Linux mechanism to work, but that will be another patch set. Having VMBus channel interrupts appear in /proc/irq means that user-space irqbalance can change the affinity of the interrupts, where previously it could not. For example, this gives irqbalance the ability to change (and perhaps degrade) the default affinity configuration that was designed to maximize network throughput. This is yet another reason to disable irqbalance in VM images. The custom VMBus entries under /sys are retained for backwards compatibility, but they are integrated. For example, changing the affinity via /proc/irq is reflected in /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu, and vice versa. Patches ------- The patches have been tested on x86/x64 Generation 1 and Generation 2 VMs, and on arm64 VMs. * Patches 1 and 2 fixes some error handling that is needed by subsequent patches. They could be applied independent of this patch set. * Patches 3,4, and 5 add support for IRQ names, and add annotations in the storvsc and the Hyper-V vPCI driver so that descriptions in /proc/interrupts convey useful information. * Patch 6 adds a new IRQ flow handler needed to properly account for IRQs as VMBus IRQs or as the top-level architectural interrupt. * Patch 7 creates the new IRQ domain for VMBus channel IRQs. * Patch 8 allocates the VMBus channel IRQs, and creates the mapping between VMBus relid (used as the hwirq) and the Linux IRQ number. That mapping is then used for lookups. * Patch 9 does request_irq() for the new VMBus channel IRQs, and converts the top-level VMBus architectural interrupt handler to use the new VMBus channel IRQs for handling channel interrupts. With this patch, /proc/interrupts shows counts for the VMBus channel IRQs. * Patch 10 implements the irq_set_affinity() function for VMBus channel IRQs, and integrates it with the existing Hyper-V-specific sysfs mechanism for changing the CPU that a VMBus channel will interrupt. * Patch 11 updates hv_synic_cleanup() during CPU offlining to handle the lack of waiting for the MODIFYCHANNEL message in vmbus_irq_set_affinity() in Patch 10. * Patch 12 fixes a race in VMBus channel interrupt assignments that existed prior to this patch set when taking a CPU offline. Open Topics ----------- * The performance impact of the additional IRQ processing in the interrupt path appears to be minimal, based on looking at the code. Measurements are still to be taken to confirm this. * Does the netvsc driver have sufficient information to annotate what appears in /proc/interrupts, particularly when multiple synthetic NICs are present? At first glance, it appears that the identifying info isn't generated until after vmbus_open() is called, and by then it's too late to do the IRQ annotation. Need some input from a netvsc expert. * The VMBus irq domain and irq chip data structures are placed in the vmbus_connection structure. Starting with VMBus protocol version 5.0, the VMBus host side supports multiple connections from a single guest instance (see commit ae20b254306a6). While there's no upstream kernel code that creates multiple VMBus connections, it's unclear whether the IRQ domain should be global across all VMBus connection, or per connection. Are relid's global (to the VM) or per connection? * I have not tried or tested any hv_sock channels. Is there a handy test program that would be convenient for opening multiple hv_sock connections to the guest, have them persist long enough to see what /proc/interrupts shows, and try changing the IRQ affinity? * I have not tried or tested Linux guest hibernation paths. The patches build and run against 6.10-rc2 and next-20240603. Michael Kelley (12): Drivers: hv: vmbus: Drop unsupported VMBus devices earlier Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs group Drivers: hv: vmbus: Add an IRQ name to VMBus channels PCI: hv: Annotate the VMBus channel IRQ name scsi: storvsc: Annotate the VMBus channel IRQ name genirq: Add per-cpu flow handler with conditional IRQ stats Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus connection Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid mapping Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts Drivers: hv: vmbus: Implement vmbus_irq_set_affinity Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining CPUs Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going offline arch/x86/kernel/cpu/mshyperv.c | 9 +- drivers/hv/channel.c | 125 ++++------ drivers/hv/channel_mgmt.c | 139 ++++++++---- drivers/hv/connection.c | 48 ++-- drivers/hv/hv.c | 90 +++++++- drivers/hv/hv_common.c | 2 +- drivers/hv/hyperv_vmbus.h | 22 +- drivers/hv/vmbus_drv.c | 338 +++++++++++++++++++--------- drivers/pci/controller/pci-hyperv.c | 5 + drivers/scsi/storvsc_drv.c | 9 + include/asm-generic/mshyperv.h | 9 +- include/linux/hyperv.h | 9 + include/linux/irq.h | 1 + kernel/irq/chip.c | 29 +++ 14 files changed, 567 insertions(+), 268 deletions(-) -- 2.25.1