> -----Original Message----- > From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> > Sent: Friday, August 18, 2023 3:32 AM > To: linux-hyperv@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > x86@xxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux- > arch@xxxxxxxxxxxxxxx > Cc: patches@xxxxxxxxxxxxxxx; Michael Kelley (LINUX) > <mikelley@xxxxxxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>; > wei.liu@xxxxxxxxxx; Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; Dexuan Cui > <decui@xxxxxxxxxxxxx>; apais@xxxxxxxxxxxxxxxxxxx; Tianyu Lan > <Tianyu.Lan@xxxxxxxxxxxxx>; ssengar@xxxxxxxxxxxxxxxxxxx; MUKESH > RATHOR <mukeshrathor@xxxxxxxxxxxxx>; stanislav.kinsburskiy@xxxxxxxxx; > jinankjain@xxxxxxxxxxxxxxxxxxx; vkuznets <vkuznets@xxxxxxxxxx>; > tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; bp@xxxxxxxxx; > dave.hansen@xxxxxxxxxxxxxxx; hpa@xxxxxxxxx; will@xxxxxxxxxx; > catalin.marinas@xxxxxxx > Subject: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to > VMMs running on Hyper-V > > Add mshv, mshv_root, and mshv_vtl modules: > > Module mshv is the parent module to the other two. It provides /dev/mshv, > plus > some common hypercall helper code. When one of the child modules is > loaded, it > is registered with the mshv module, which then provides entry point(s) to the > child module via the IOCTLs defined in uapi/linux/mshv.h. > > E.g. When the mshv_root module is loaded, it registers itself, and the > MSHV_CREATE_PARTITION IOCTL becomes available in /dev/mshv. That is > used to > get a partition fd managed by mshv_root. > > Similarly for mshv_vtl module, there is MSHV_CREATE_VTL, which creates > an fd representing the lower vtl, managed by mshv_vtl. > > Module mshv_root provides APIs for creating and managing child partitions. > It > defines abstractions for partitions (vms), vps (vcpus), and other things > related to running a guest. It exposes the userspace interfaces for a VMM to > manage the guest. > > Module mshv_vtl provides VTL (Virtual Trust Level) support for VMMs. In > this scenario, the host kernel and VMM run in a higher trust level than the > guest, but within the same partition. This provides better isolation and > performance. > > Signed-off-by: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> > --- > drivers/hv/Kconfig | 50 + > drivers/hv/Makefile | 20 + > drivers/hv/hv_call.c | 119 ++ > drivers/hv/hv_common.c | 4 + > drivers/hv/mshv.h | 156 +++ > drivers/hv/mshv_eventfd.c | 758 ++++++++++++ > drivers/hv/mshv_eventfd.h | 80 ++ > drivers/hv/mshv_main.c | 208 ++++ > drivers/hv/mshv_msi.c | 129 +++ > drivers/hv/mshv_portid_table.c | 84 ++ > drivers/hv/mshv_root.h | 194 ++++ > drivers/hv/mshv_root_hv_call.c | 1064 +++++++++++++++++ > drivers/hv/mshv_root_main.c | 1964 > ++++++++++++++++++++++++++++++++ > drivers/hv/mshv_synic.c | 689 +++++++++++ > drivers/hv/mshv_vtl.h | 52 + > drivers/hv/mshv_vtl_main.c | 1542 +++++++++++++++++++++++++ > drivers/hv/xfer_to_guest.c | 28 + > include/uapi/linux/mshv.h | 298 +++++ > 18 files changed, 7439 insertions(+) > create mode 100644 drivers/hv/hv_call.c > create mode 100644 drivers/hv/mshv.h > create mode 100644 drivers/hv/mshv_eventfd.c > create mode 100644 drivers/hv/mshv_eventfd.h > create mode 100644 drivers/hv/mshv_main.c > create mode 100644 drivers/hv/mshv_msi.c > create mode 100644 drivers/hv/mshv_portid_table.c > create mode 100644 drivers/hv/mshv_root.h > create mode 100644 drivers/hv/mshv_root_hv_call.c > create mode 100644 drivers/hv/mshv_root_main.c > create mode 100644 drivers/hv/mshv_synic.c > create mode 100644 drivers/hv/mshv_vtl.h > create mode 100644 drivers/hv/mshv_vtl_main.c > create mode 100644 drivers/hv/xfer_to_guest.c > create mode 100644 include/uapi/linux/mshv.h > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig > index 00242107d62e..0d9aefc07b15 100644 > --- a/drivers/hv/Kconfig > +++ b/drivers/hv/Kconfig > @@ -54,4 +54,54 @@ config HYPERV_BALLOON > help > Select this option to enable Hyper-V Balloon driver. > > +config MSHV > + tristate "Microsoft Hypervisor root partition interfaces: /dev/mshv" > + depends on X86_64 && HYPERV > + select EVENTFD > + select MSHV_XFER_TO_GUEST_WORK > + help > + Select this option to enable core functionality for managing guest > + virtual machines running under the Microsoft Hypervisor. > + > + The interfaces are provided via a device named /dev/mshv. > + > + To compile this as a module, choose M here. > + > + If unsure, say N. > + > +config MSHV_ROOT > + tristate "Microsoft Hyper-V root partition APIs driver" > + depends on MSHV > + help > + Select this option to provide /dev/mshv interfaces specific to > + running as the root partition on Microsoft Hypervisor. > + > + To compile this as a module, choose M here. > + > + If unsure, say N. > + > +config MSHV_VTL > + tristate "Microsoft Hyper-V VTL driver" > + depends on MSHV > + select HYPERV_VTL_MODE > + select TRANSPARENT_HUGEPAGE TRANSPARENT_HUGEPAGE can be avoided for now. > + help > + Select this option to enable Hyper-V VTL driver. > + Virtual Secure Mode (VSM) is a set of hypervisor capabilities and > + enlightenments offered to host and guest partitions which enables > + the creation and management of new security boundaries within > + operating system software. > + > + VSM achieves and maintains isolation through Virtual Trust Levels > + (VTLs). Virtual Trust Levels are hierarchical, with higher levels > + being more privileged than lower levels. VTL0 is the least privileged > + level, and currently only other level supported is VTL2. > + > + To compile this as a module, choose M here. > + > + If unsure, say N. > + > +config MSHV_XFER_TO_GUEST_WORK > + bool > + > endmenu > diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile > index d76df5c8c2a9..da7aa7542b05 100644 > --- a/drivers/hv/Makefile > +++ b/drivers/hv/Makefile > @@ -2,10 +2,30 @@ > obj-$(CONFIG_HYPERV) += hv_vmbus.o > obj-$(CONFIG_HYPERV_UTILS) += hv_utils.o > obj-$(CONFIG_HYPERV_BALLOON) += hv_balloon.o > +obj-$(CONFIG_MSHV) += mshv.o > +obj-$(CONFIG_MSHV_VTL) += mshv_vtl.o > +obj-$(CONFIG_MSHV_ROOT) += mshv_root.o > > CFLAGS_hv_trace.o = -I$(src) > CFLAGS_hv_balloon.o = -I$(src) > > +CFLAGS_mshv_main.o = -DHV_HYPERV_DEFS > +CFLAGS_hv_call.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_root_main.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_root_hv_call.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_synic.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_portid_table.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_eventfd.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_msi.o = -DHV_HYPERV_DEFS > +CFLAGS_mshv_vtl_main.o = -DHV_HYPERV_DEFS > + > +mshv-y += mshv_main.o > +mshv_root-y := mshv_root_main.o mshv_synic.o > mshv_portid_table.o \ > + mshv_eventfd.o mshv_msi.o > mshv_root_hv_call.o hv_call.o > +mshv_vtl-y := mshv_vtl_main.o hv_call.o > + > +obj-$(CONFIG_MSHV_XFER_TO_GUEST_WORK) += xfer_to_guest.o > + > hv_vmbus-y := vmbus_drv.o \ > hv.o connection.o channel.o \ > channel_mgmt.o ring_buffer.o hv_trace.o > diff --git a/drivers/hv/hv_call.c b/drivers/hv/hv_call.c > new file mode 100644 > index 000000000000..4455001d8545 > --- /dev/null > +++ b/drivers/hv/hv_call.c > @@ -0,0 +1,119 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (c) 2023, Microsoft Corporation. > + * > + * Hypercall helper functions shared between mshv modules. > + * > + * Authors: > + * Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> > + */ > + > +#include <linux/kernel.h> > +#include <linux/mm.h> > +#include <asm/mshyperv.h> > + > +#define HV_GET_REGISTER_BATCH_SIZE \ > + (HV_HYP_PAGE_SIZE / sizeof(union hv_register_value)) > +#define HV_SET_REGISTER_BATCH_SIZE \ > + ((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \ > + / sizeof(struct hv_register_assoc)) > + > +int hv_call_get_vp_registers( > + u32 vp_index, > + u64 partition_id, > + u16 count, > + union hv_input_vtl input_vtl, > + struct hv_register_assoc *registers) > +{ > + struct hv_input_get_vp_registers *input_page; > + union hv_register_value *output_page; > + u16 completed = 0; > + unsigned long remaining = count; > + int rep_count, i; > + u64 status; > + unsigned long flags; > + > + local_irq_save(flags); > + > + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg); > + output_page = *this_cpu_ptr(hyperv_pcpu_output_arg); > + > + input_page->partition_id = partition_id; > + input_page->vp_index = vp_index; > + input_page->input_vtl.as_uint8 = input_vtl.as_uint8; > + input_page->rsvd_z8 = 0; > + input_page->rsvd_z16 = 0; > + > + while (remaining) { > + rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE); > + for (i = 0; i < rep_count; ++i) > + input_page->names[i] = registers[i].name; > + > + status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, > rep_count, > + 0, input_page, output_page); Is there any possibility that count value is passed 0 by mistake ? In that case status will remain uninitialized. > + if (!hv_result_success(status)) { > + pr_err("%s: completed %li out of %u, %s\n", > + __func__, > + count - remaining, count, > + hv_status_to_string(status)); > + break; > + } > + completed = hv_repcomp(status); > + for (i = 0; i < completed; ++i) > + registers[i].value = output_page[i]; > + > + registers += completed; > + remaining -= completed; > + } > + local_irq_restore(flags); > + > + return hv_status_to_errno(status); > +} > + > +int hv_call_set_vp_registers( > + u32 vp_index, > + u64 partition_id, > + u16 count, > + union hv_input_vtl input_vtl, > + struct hv_register_assoc *registers) > +{ > + struct hv_input_set_vp_registers *input_page; > + u16 completed = 0; > + unsigned long remaining = count; > + int rep_count; > + u64 status; > + unsigned long flags; > + > + local_irq_save(flags); > + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg); > + > + input_page->partition_id = partition_id; > + input_page->vp_index = vp_index; > + input_page->input_vtl.as_uint8 = input_vtl.as_uint8; > + input_page->rsvd_z8 = 0; > + input_page->rsvd_z16 = 0; > + > + while (remaining) { > + rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE); > + memcpy(input_page->elements, registers, > + sizeof(struct hv_register_assoc) * rep_count); > + > + status = hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, > rep_count, > + 0, input_page, NULL); > + if (!hv_result_success(status)) { > + pr_err("%s: completed %li out of %u, %s\n", > + __func__, > + count - remaining, count, > + hv_status_to_string(status)); > + break; > + } > + completed = hv_repcomp(status); > + registers += completed; > + remaining -= completed; > + } > + > + local_irq_restore(flags); > + > + return hv_status_to_errno(status); > +} > + > diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c > index 13f972e72375..ccd76f30a638 100644 > --- a/drivers/hv/hv_common.c > +++ b/drivers/hv/hv_common.c > @@ -62,7 +62,11 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg); > */ > static inline bool hv_output_arg_exists(void) > { > +#ifdef CONFIG_MSHV_VTL Although today both the option works together. But thinking which is more accurate CONFIG_HYPERV_VTL_MODE or CONFIG_MSHV_VTL here for scalability of VTL modules. > + return true; > +#else > return hv_root_partition ? true : false; > +#endif > } > > static void hv_kmsg_dump_unregister(void); > diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h > new file mode 100644 > index 000000000000..166480a73f3f > --- /dev/null > +++ b/drivers/hv/mshv.h > @@ -0,0 +1,156 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Copyright (c) 2023, Microsoft Corporation. > + */ > + > +#ifndef _MSHV_H_ > +#define _MSHV_H_ > + > +#include <linux/spinlock.h> > +#include <linux/mutex.h> > +#include <linux/semaphore.h> > +#include <linux/sched.h> > +#include <linux/srcu.h> > +#include <linux/wait.h> > +#include <uapi/linux/mshv.h> > + > +/* > + * Hyper-V hypercalls > + */ > + > +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id); > +int hv_call_create_partition( > + u64 flags, > + struct hv_partition_creation_properties creation_properties, > + union hv_partition_isolation_properties isolation_properties, > + u64 *partition_id); > +int hv_call_initialize_partition(u64 partition_id); > +int hv_call_finalize_partition(u64 partition_id); > +int hv_call_delete_partition(u64 partition_id); > +int hv_call_map_gpa_pages( > + u64 partition_id, > + u64 gpa_target, > + u64 page_count, u32 flags, > + struct page **pages); > +int hv_call_unmap_gpa_pages( > + u64 partition_id, > + u64 gpa_target, > + u64 page_count, u32 flags); > +int hv_call_get_vp_registers( > + u32 vp_index, > + u64 partition_id, > + u16 count, > + union hv_input_vtl input_vtl, > + struct hv_register_assoc *registers); > +int hv_call_get_gpa_access_states( > + u64 partition_id, > + u32 count, > + u64 gpa_base_pfn, > + u64 state_flags, > + int *written_total, > + union hv_gpa_page_access_state *states); > + > +int hv_call_set_vp_registers( > + u32 vp_index, > + u64 partition_id, > + u16 count, > + union hv_input_vtl input_vtl, > + struct hv_register_assoc *registers); Nit: Opportunity to fix many of the checkpatch.pl related to line break here and many other places. > +int hv_call_install_intercept(u64 partition_id, u32 access_type, > + enum hv_intercept_type intercept_type, > + union hv_intercept_parameters intercept_parameter); > +int hv_call_assert_virtual_interrupt( > + u64 partition_id, > + u32 vector, > + u64 dest_addr, > + union hv_interrupt_control control); > +int hv_call_clear_virtual_interrupt(u64 partition_id); > + > +#ifdef HV_SUPPORTS_VP_STATE > +int hv_call_get_vp_state( > + u32 vp_index, > + u64 partition_id, > + enum hv_get_set_vp_state_type type, > + struct hv_vp_state_data_xsave xsave, > + /* Choose between pages and ret_output */ > + u64 page_count, > + struct page **pages, > + union hv_output_get_vp_state *ret_output); > +int hv_call_set_vp_state( > + u32 vp_index, > + u64 partition_id, > + enum hv_get_set_vp_state_type type, > + struct hv_vp_state_data_xsave xsave, > + /* Choose between pages and bytes */ > + u64 page_count, > + struct page **pages, > + u32 num_bytes, > + u8 *bytes); > +#endif > + > +int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type, > + struct page **state_page); > +int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 > type); > +int hv_call_get_partition_property( > + u64 partition_id, > + u64 property_code, > + u64 *property_value); > +int hv_call_set_partition_property( > + u64 partition_id, u64 property_code, u64 property_value, > + void (*completion_handler)(void * /* data */, u64 * /* status */), > + void *completion_data); > +int hv_call_translate_virtual_address( > + u32 vp_index, > + u64 partition_id, > + u64 flags, > + u64 gva, > + u64 *gpa, > + union hv_translate_gva_result *result); > +int hv_call_get_vp_cpuid_values( > + u32 vp_index, > + u64 partition_id, > + union hv_get_vp_cpuid_values_flags values_flags, > + struct hv_cpuid_leaf_info *info, > + union hv_output_get_vp_cpuid_values *result); > + > +int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id, > + u64 connection_partition_id, struct hv_port_info > *port_info, > + u8 port_vtl, u8 min_connection_vtl, int node); > +int hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id); > +int hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id, > + u64 connection_partition_id, > + union hv_connection_id connection_id, > + struct hv_connection_info *connection_info, > + u8 connection_vtl, int node); > +int hv_call_disconnect_port(u64 connection_partition_id, > + union hv_connection_id connection_id); > +int hv_call_notify_port_ring_empty(u32 sint_index); > +#ifdef HV_SUPPORTS_REGISTER_INTERCEPT > +int hv_call_register_intercept_result(u32 vp_index, > + u64 partition_id, > + enum hv_intercept_type intercept_type, > + union > hv_register_intercept_result_parameters *params); > +#endif > +int hv_call_signal_event_direct(u32 vp_index, > + u64 partition_id, > + u8 vtl, > + u8 sint, > + u16 flag_number, > + u8 *newly_signaled); > +int hv_call_post_message_direct(u32 vp_index, > + u64 partition_id, > + u8 vtl, > + u32 sint_index, > + u8 *message); > + > +struct mshv_partition *mshv_partition_find(u64 partition_id) > __must_hold(RCU); > + > +int mshv_xfer_to_guest_mode_handle_work(unsigned long ti_work); > + > +typedef long (*mshv_create_func_t)(void __user *user_arg); > +typedef long (*mshv_check_ext_func_t)(u32 arg); > +int mshv_setup_vtl_func(const mshv_create_func_t create_vtl, > + const mshv_check_ext_func_t check_ext); > +int mshv_set_create_partition_func(const mshv_create_func_t func); > + > +#endif /* _MSHV_H */ > diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c > new file mode 100644 > index 000000000000..ddc64fe3920e > --- /dev/null > +++ b/drivers/hv/mshv_eventfd.c > @@ -0,0 +1,758 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * eventfd support for mshv > + * > + * Heavily inspired from KVM implementation of irqfd/ioeventfd. The basic > + * framework code is taken from the kvm implementation. > + * > + * All credits to kvm developers. > + */ > + > +#include <linux/syscalls.h> > +#include <linux/wait.h> > +#include <linux/poll.h> > +#include <linux/file.h> > +#include <linux/list.h> > +#include <linux/workqueue.h> > +#include <linux/eventfd.h> > + > +#include "mshv_eventfd.h" > +#include "mshv.h" > +#include "mshv_root.h" > + > +static struct workqueue_struct *irqfd_cleanup_wq; > + > +void > +mshv_register_irq_ack_notifier(struct mshv_partition *partition, > + struct mshv_irq_ack_notifier *mian) > +{ > + mutex_lock(&partition->irq_lock); > + hlist_add_head_rcu(&mian->link, &partition->irq_ack_notifier_list); > + mutex_unlock(&partition->irq_lock); > +} > + > +void > +mshv_unregister_irq_ack_notifier(struct mshv_partition *partition, > + struct mshv_irq_ack_notifier *mian) > +{ > + mutex_lock(&partition->irq_lock); > + hlist_del_init_rcu(&mian->link); > + mutex_unlock(&partition->irq_lock); > + synchronize_rcu(); > +} > + > +bool > +mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi) > +{ > + struct mshv_irq_ack_notifier *mian; > + bool acked = false; > + > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mian, &partition->irq_ack_notifier_list, > + link) { > + if (mian->gsi == gsi) { > + mian->irq_acked(mian); > + acked = true; > + } > + } > + rcu_read_unlock(); > + > + return acked; > +} > + > +static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type) > +{ > + return type == HV_X64_INTERRUPT_TYPE_EXTINT; > +} > + > +static void > +irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian) > +{ > + struct mshv_kernel_irqfd_resampler *resampler; > + struct mshv_partition *partition; > + struct mshv_kernel_irqfd *irqfd; > + int idx; > + > + resampler = container_of(mian, > + struct mshv_kernel_irqfd_resampler, notifier); > + partition = resampler->partition; > + > + idx = srcu_read_lock(&partition->irq_srcu); > + > + hlist_for_each_entry_rcu(irqfd, &resampler->irqfds_list, > resampler_hnode) { > + if (hv_should_clear_interrupt(irqfd- > >lapic_irq.control.interrupt_type)) > + hv_call_clear_virtual_interrupt(partition->id); > + > + eventfd_signal(irqfd->resamplefd, 1); > + } > + > + srcu_read_unlock(&partition->irq_srcu, idx); > +} > + > +static void > +irqfd_assert(struct work_struct *work) > +{ > + struct mshv_kernel_irqfd *irqfd = > + container_of(work, struct mshv_kernel_irqfd, assert); > + struct mshv_lapic_irq *irq = &irqfd->lapic_irq; > + > + hv_call_assert_virtual_interrupt(irqfd->partition->id, > + irq->vector, irq->apic_id, > + irq->control); > +} > + > +static void > +irqfd_inject(struct mshv_kernel_irqfd *irqfd) > +{ > + struct mshv_partition *partition = irqfd->partition; > + struct mshv_lapic_irq *irq = &irqfd->lapic_irq; > + unsigned int seq; > + int idx; > + > + WARN_ON(irqfd->resampler && > + !irq->control.level_triggered); > + > + idx = srcu_read_lock(&partition->irq_srcu); > + if (irqfd->msi_entry.gsi) { > + if (!irqfd->msi_entry.entry_valid) { > + pr_warn("Invalid routing info for gsi %u", > + irqfd->msi_entry.gsi); > + srcu_read_unlock(&partition->irq_srcu, idx); > + return; > + } > + > + do { > + seq = read_seqcount_begin(&irqfd->msi_entry_sc); > + } while (read_seqcount_retry(&irqfd->msi_entry_sc, seq)); > + } > + > + srcu_read_unlock(&partition->irq_srcu, idx); > + > + schedule_work(&irqfd->assert); > +} > + > +static void > +irqfd_resampler_shutdown(struct mshv_kernel_irqfd *irqfd) > +{ > + struct mshv_kernel_irqfd_resampler *resampler = irqfd->resampler; > + struct mshv_partition *partition = resampler->partition; > + > + mutex_lock(&partition->irqfds.resampler_lock); > + > + hlist_del_rcu(&irqfd->resampler_hnode); > + synchronize_srcu(&partition->irq_srcu); > + > + if (hlist_empty(&resampler->irqfds_list)) { > + hlist_del(&resampler->hnode); > + mshv_unregister_irq_ack_notifier(partition, &resampler- > >notifier); > + kfree(resampler); > + } > + > + mutex_unlock(&partition->irqfds.resampler_lock); > +} > + > +/* > + * Race-free decouple logic (ordering is critical) > + */ > +static void > +irqfd_shutdown(struct work_struct *work) > +{ > + struct mshv_kernel_irqfd *irqfd = > + container_of(work, struct mshv_kernel_irqfd, shutdown); > + > + /* > + * Synchronize with the wait-queue and unhook ourselves to prevent > + * further events. > + */ > + remove_wait_queue(irqfd->wqh, &irqfd->wait); > + > + if (irqfd->resampler) { > + irqfd_resampler_shutdown(irqfd); > + eventfd_ctx_put(irqfd->resamplefd); > + } > + > + /* > + * We know no new events will be scheduled at this point, so block > + * until all previously outstanding events have completed > + */ > + flush_work(&irqfd->assert); > + > + /* > + * It is now safe to release the object's resources > + */ > + eventfd_ctx_put(irqfd->eventfd); > + kfree(irqfd); > +} > + > +/* assumes partition->irqfds.lock is held */ > +static bool > +irqfd_is_active(struct mshv_kernel_irqfd *irqfd) > +{ > + return !hlist_unhashed(&irqfd->hnode); > +} > + > +/* > + * Mark the irqfd as inactive and schedule it for removal > + * > + * assumes partition->irqfds.lock is held > + */ > +static void > +irqfd_deactivate(struct mshv_kernel_irqfd *irqfd) > +{ > + WARN_ON(!irqfd_is_active(irqfd)); > + > + hlist_del(&irqfd->hnode); > + > + queue_work(irqfd_cleanup_wq, &irqfd->shutdown); > +} > + > +/* > + * Called with wqh->lock held and interrupts disabled > + */ > +static int > +irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode, > + int sync, void *key) > +{ > + struct mshv_kernel_irqfd *irqfd = > + container_of(wait, struct mshv_kernel_irqfd, wait); > + unsigned long flags = (unsigned long)key; > + int idx; > + unsigned int seq; > + struct mshv_partition *partition = irqfd->partition; > + int ret = 0; > + > + if (flags & POLLIN) { > + u64 cnt; > + > + eventfd_ctx_do_read(irqfd->eventfd, &cnt); > + idx = srcu_read_lock(&partition->irq_srcu); > + do { > + seq = read_seqcount_begin(&irqfd->msi_entry_sc); > + } while (read_seqcount_retry(&irqfd->msi_entry_sc, seq)); > + > + /* An event has been signaled, inject an interrupt */ > + irqfd_inject(irqfd); > + srcu_read_unlock(&partition->irq_srcu, idx); > + > + ret = 1; > + } > + > + if (flags & POLLHUP) { > + /* The eventfd is closing, detach from Partition */ > + unsigned long flags; > + > + spin_lock_irqsave(&partition->irqfds.lock, flags); > + > + /* > + * We must check if someone deactivated the irqfd before > + * we could acquire the irqfds.lock since the item is > + * deactivated from the mshv side before it is unhooked from > + * the wait-queue. If it is already deactivated, we can > + * simply return knowing the other side will cleanup for us. > + * We cannot race against the irqfd going away since the > + * other side is required to acquire wqh->lock, which we hold > + */ > + if (irqfd_is_active(irqfd)) > + irqfd_deactivate(irqfd); > + > + spin_unlock_irqrestore(&partition->irqfds.lock, flags); > + } > + > + return ret; > +} > + > +/* Must be called under irqfds.lock */ > +static void irqfd_update(struct mshv_partition *partition, > + struct mshv_kernel_irqfd *irqfd) > +{ > + write_seqcount_begin(&irqfd->msi_entry_sc); > + irqfd->msi_entry = mshv_msi_map_gsi(partition, irqfd->gsi); > + mshv_set_msi_irq(&irqfd->msi_entry, &irqfd->lapic_irq); > + write_seqcount_end(&irqfd->msi_entry_sc); > +} > + > +void mshv_irqfd_routing_update(struct mshv_partition *partition) > +{ > + struct mshv_kernel_irqfd *irqfd; > + > + spin_lock_irq(&partition->irqfds.lock); > + hlist_for_each_entry(irqfd, &partition->irqfds.items, hnode) > + irqfd_update(partition, irqfd); > + spin_unlock_irq(&partition->irqfds.lock); > +} > + > +static void > +irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, > + poll_table *pt) > +{ > + struct mshv_kernel_irqfd *irqfd = > + container_of(pt, struct mshv_kernel_irqfd, pt); > + > + irqfd->wqh = wqh; > + add_wait_queue_priority(wqh, &irqfd->wait); > +} > + > +static int > +mshv_irqfd_assign(struct mshv_partition *partition, > + struct mshv_irqfd *args) > +{ > + struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL; > + struct mshv_kernel_irqfd *irqfd, *tmp; > + unsigned int events; > + struct fd f; > + int ret; > + int idx; > + > + irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); > + if (!irqfd) > + return -ENOMEM; > + > + irqfd->partition = partition; > + irqfd->gsi = args->gsi; > + INIT_WORK(&irqfd->shutdown, irqfd_shutdown); > + INIT_WORK(&irqfd->assert, irqfd_assert); > + seqcount_spinlock_init(&irqfd->msi_entry_sc, > + &partition->irqfds.lock); > + > + f = fdget(args->fd); > + if (!f.file) { > + ret = -EBADF; > + goto out; > + } > + > + eventfd = eventfd_ctx_fileget(f.file); > + if (IS_ERR(eventfd)) { > + ret = PTR_ERR(eventfd); > + goto fail; > + } > + > + irqfd->eventfd = eventfd; > + > + if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE) { > + struct mshv_kernel_irqfd_resampler *resampler; > + > + resamplefd = eventfd_ctx_fdget(args->resamplefd); > + if (IS_ERR(resamplefd)) { > + ret = PTR_ERR(resamplefd); > + goto fail; > + } > + > + irqfd->resamplefd = resamplefd; > + > + mutex_lock(&partition->irqfds.resampler_lock); > + > + hlist_for_each_entry(resampler, > + &partition->irqfds.resampler_list, hnode) { > + if (resampler->notifier.gsi == irqfd->gsi) { > + irqfd->resampler = resampler; > + break; > + } > + } > + > + if (!irqfd->resampler) { > + resampler = kzalloc(sizeof(*resampler), > + GFP_KERNEL_ACCOUNT); > + if (!resampler) { > + ret = -ENOMEM; > + mutex_unlock(&partition- > >irqfds.resampler_lock); > + goto fail; > + } > + > + resampler->partition = partition; > + INIT_HLIST_HEAD(&resampler->irqfds_list); > + resampler->notifier.gsi = irqfd->gsi; > + resampler->notifier.irq_acked = irqfd_resampler_ack; > + > + hlist_add_head(&resampler->hnode, &partition- > >irqfds.resampler_list); > + mshv_register_irq_ack_notifier(partition, > + &resampler->notifier); > + irqfd->resampler = resampler; > + } > + > + hlist_add_head_rcu(&irqfd->resampler_hnode, &irqfd- > >resampler->irqfds_list); > + > + mutex_unlock(&partition->irqfds.resampler_lock); > + } > + > + /* > + * Install our own custom wake-up handling so we are notified via > + * a callback whenever someone signals the underlying eventfd > + */ > + init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup); > + init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc); > + > + spin_lock_irq(&partition->irqfds.lock); > + if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE && > + !irqfd->lapic_irq.control.level_triggered) { > + /* > + * Resample Fd must be for level triggered interrupt > + * Otherwise return with failure > + */ > + spin_unlock_irq(&partition->irqfds.lock); > + ret = -EINVAL; > + goto fail; > + } > + ret = 0; > + hlist_for_each_entry(tmp, &partition->irqfds.items, hnode) { > + if (irqfd->eventfd != tmp->eventfd) > + continue; > + /* This fd is used for another irq already. */ > + ret = -EBUSY; > + spin_unlock_irq(&partition->irqfds.lock); > + goto fail; > + } > + > + idx = srcu_read_lock(&partition->irq_srcu); > + irqfd_update(partition, irqfd); > + hlist_add_head(&irqfd->hnode, &partition->irqfds.items); > + spin_unlock_irq(&partition->irqfds.lock); > + > + /* > + * Check if there was an event already pending on the eventfd > + * before we registered, and trigger it as if we didn't miss it. > + */ > + events = vfs_poll(f.file, &irqfd->pt); > + > + if (events & POLLIN) > + irqfd_inject(irqfd); > + > + srcu_read_unlock(&partition->irq_srcu, idx); > + /* > + * do not drop the file until the irqfd is fully initialized, otherwise > + * we might race against the POLLHUP > + */ > + fdput(f); > + > + return 0; > + > +fail: > + if (irqfd->resampler) > + irqfd_resampler_shutdown(irqfd); > + > + if (resamplefd && !IS_ERR(resamplefd)) > + eventfd_ctx_put(resamplefd); > + > + if (eventfd && !IS_ERR(eventfd)) > + eventfd_ctx_put(eventfd); > + > + fdput(f); > + > +out: > + kfree(irqfd); > + return ret; > +} > + > +/* > + * shutdown any irqfd's that match fd+gsi > + */ > +static int > +mshv_irqfd_deassign(struct mshv_partition *partition, > + struct mshv_irqfd *args) > +{ > + struct mshv_kernel_irqfd *irqfd; > + struct hlist_node *n; > + struct eventfd_ctx *eventfd; > + > + eventfd = eventfd_ctx_fdget(args->fd); > + if (IS_ERR(eventfd)) > + return PTR_ERR(eventfd); > + > + hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode) { > + if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) > + irqfd_deactivate(irqfd); > + } > + > + eventfd_ctx_put(eventfd); > + > + /* > + * Block until we know all outstanding shutdown jobs have completed > + * so that we guarantee there will not be any more interrupts on this > + * gsi once this deassign function returns. > + */ > + flush_workqueue(irqfd_cleanup_wq); > + > + return 0; > +} > + > +int > +mshv_irqfd(struct mshv_partition *partition, struct mshv_irqfd *args) > +{ > + if (args->flags & MSHV_IRQFD_FLAG_DEASSIGN) > + return mshv_irqfd_deassign(partition, args); > + > + return mshv_irqfd_assign(partition, args); > +} > + > +/* > + * This function is called as the mshv VM fd is being released. > + * Shutdown all irqfds that still remain open > + */ > +static void > +mshv_irqfd_release(struct mshv_partition *partition) > +{ > + struct mshv_kernel_irqfd *irqfd; > + struct hlist_node *n; > + > + spin_lock_irq(&partition->irqfds.lock); > + > + hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode) > + irqfd_deactivate(irqfd); > + > + spin_unlock_irq(&partition->irqfds.lock); > + > + /* > + * Block until we know all outstanding shutdown jobs have completed > + * since we do not take a mshv_partition* reference. > + */ > + flush_workqueue(irqfd_cleanup_wq); > + > +} > + > +int mshv_irqfd_wq_init(void) > +{ > + irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0); > + if (!irqfd_cleanup_wq) > + return -ENOMEM; > + > + return 0; > +} > + > +void mshv_irqfd_wq_cleanup(void) > +{ > + destroy_workqueue(irqfd_cleanup_wq); > +} > + > +/* > + * -------------------------------------------------------------------- > + * ioeventfd: translate a MMIO memory write to an eventfd signal. > + * > + * userspace can register a MMIO address with an eventfd for receiving > + * notification when the memory has been touched. > + * > + * TODO: Implement eventfd for PIO as well. > + * -------------------------------------------------------------------- > + */ > + > +static void > +ioeventfd_release(struct kernel_mshv_ioeventfd *p, u64 partition_id) > +{ > + if (p->doorbell_id > 0) > + mshv_unregister_doorbell(partition_id, p->doorbell_id); > + eventfd_ctx_put(p->eventfd); > + kfree(p); > +} > + > +/* MMIO writes trigger an event if the addr/val match */ > +static void > +ioeventfd_mmio_write(int doorbell_id, void *data) > +{ > + struct mshv_partition *partition = (struct mshv_partition *)data; > + struct kernel_mshv_ioeventfd *p; > + > + rcu_read_lock(); > + hlist_for_each_entry_rcu(p, &partition->ioeventfds.items, hnode) { > + if (p->doorbell_id == doorbell_id) { > + eventfd_signal(p->eventfd, 1); > + break; > + } > + } > + rcu_read_unlock(); > +} > + > +static bool > +ioeventfd_check_collision(struct mshv_partition *partition, > + struct kernel_mshv_ioeventfd *p) > + __must_hold(&partition->mutex) > +{ > + struct kernel_mshv_ioeventfd *_p; > + > + hlist_for_each_entry(_p, &partition->ioeventfds.items, hnode) > + if (_p->addr == p->addr && _p->length == p->length && > + (_p->wildcard || p->wildcard || > + _p->datamatch == p->datamatch)) > + return true; > + > + return false; > +} > + > +static int > +mshv_assign_ioeventfd(struct mshv_partition *partition, > + struct mshv_ioeventfd *args) > + __must_hold(&partition->mutex) > +{ > + struct kernel_mshv_ioeventfd *p; > + struct eventfd_ctx *eventfd; > + u64 doorbell_flags = 0; > + int ret; > + > + /* This mutex is currently protecting ioeventfd.items list */ > + WARN_ON_ONCE(!mutex_is_locked(&partition->mutex)); > + > + if (args->flags & MSHV_IOEVENTFD_FLAG_PIO) > + return -EOPNOTSUPP; > + > + /* must be natural-word sized */ > + switch (args->len) { > + case 0: > + doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY; > + break; > + case 1: > + doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE; > + break; > + case 2: > + doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD; > + break; > + case 4: > + doorbell_flags = > HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD; > + break; > + case 8: > + doorbell_flags = > HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD; > + break; > + default: > + pr_warn("ioeventfd: invalid length specified\n"); > + return -EINVAL; > + } > + > + /* check for range overflow */ > + if (args->addr + args->len < args->addr) > + return -EINVAL; > + > + /* check for extra flags that we don't understand */ > + if (args->flags & ~MSHV_IOEVENTFD_VALID_FLAG_MASK) > + return -EINVAL; > + > + eventfd = eventfd_ctx_fdget(args->fd); > + if (IS_ERR(eventfd)) > + return PTR_ERR(eventfd); > + > + p = kzalloc(sizeof(*p), GFP_KERNEL); > + if (!p) { > + ret = -ENOMEM; > + goto fail; > + } > + > + p->addr = args->addr; > + p->length = args->len; > + p->eventfd = eventfd; > + > + /* The datamatch feature is optional, otherwise this is a wildcard */ > + if (args->flags & MSHV_IOEVENTFD_FLAG_DATAMATCH) > + p->datamatch = args->datamatch; > + else { > + p->wildcard = true; > + doorbell_flags |= > HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE; > + } > + > + if (ioeventfd_check_collision(partition, p)) { > + ret = -EEXIST; > + goto unlock_fail; > + } > + > + ret = mshv_register_doorbell(partition->id, ioeventfd_mmio_write, > + (void *)partition, p->addr, > + p->datamatch, doorbell_flags); > + if (ret < 0) { > + pr_err("Failed to register ioeventfd doorbell!\n"); Nit: Do we like to print function name at the start of pr_err. - Saurabh