Hi Greg, Are there any other changes that you'd like to see with this driver since your last comments [1]? [1] https://lore.kernel.org/linux-mm/2024051414-untie-deviant-ed35@gregkh/ Thanks, Yuanchu On Mon, Sep 30, 2024 at 6:14 PM Yuanchu Xie <yuanchu@xxxxxxxxxx> wrote: > > I made a mistake. This is supposed to be v3. > > On Mon, Sep 30, 2024 at 6:13 PM Yuanchu Xie <yuanchu@xxxxxxxxxx> wrote: > > > > Pvmemcontrol provides a way for the guest to control its physical memory > > properties, and enables optimizations and security features. For > > example, the guest can provide information to the host where parts of a > > hugepage may be unbacked, or sensitive data may not be swapped out, etc. > > > > Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT, > > and also some other properties of the memory map the back's host memory. > > This is achieved by using the KVM_CAP_SYNC_MMU capability. When this > > capability is available, the changes in the backing of the memory region > > on the host are automatically reflected into the guest. For example, an > > mmap() or madvise() that affects the region will be made visible > > immediately. > > > > There are two components of the implementation: the guest Linux driver > > and Virtual Machine Monitor (VMM) device. A guest-allocated shared > > buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM > > device assigns a unique command for each per-cpu buffer. The guest > > writes its pvmemcontrol request in the per-cpu buffer, then writes the > > corresponding command into the command register, calling into the VMM > > device to perform the pvmemcontrol request. > > > > The synchronous per-cpu shared buffer approach avoids the kick and busy > > waiting that the guest would have to do with virtio virtqueue transport. > > > > User API > > From the userland, the pvmemcontrol guest driver is controlled via > > ioctl(2) call. It requires CAP_SYS_ADMIN. > > > > ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf); > > > > Guest userland applications can tag VMAs and guest hugepages, or advise > > the host on how to handle sensitive guest pages. > > > > Supported function codes and their use cases: > > PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce > > the struct page and page table lookup overhead by using hugepages backed > > by smaller pages on the host. These pvmemcontrol commands can allow for > > partial freeing of private guest hugepages to save memory. They also > > allow kernel memory, such as kernel stacks and task_structs to be > > paravirtualized if we expose kernel APIs. > > > > PVMEMCONTROL_MERGEABLE can inform the host KSM to deduplicate VM pages. > > > > PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not > > want to share its backing pages. > > The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included > > in a dump. > > MLOCK/UNLOCK can advise the host that sensitive information is not > > swapped out on the host. > > > > PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, > > stack guard pages can be handled in the host and memory can be saved in > > the hugepage. > > > > PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging > > how guest memory is being mapped on the host. > > > > Sample program making use of PVMEMCONTROL_DONTNEED: > > https://github.com/Dummyc0m/pvmemcontrol-user > > > > The VMM implementation is part of Cloud Hypervisor, the feature > > pvmemcontrol can be enabled and the VMM can then provide the device to a > > supporting guest. > > https://github.com/cloud-hypervisor/cloud-hypervisor > > > > - > > Changelog > > PATCH v2 -> v3 > > - added PVMEMCONTROL_MERGEABLE for memory dedupe. > > - updated link to the upstream Cloud Hypervisor repo, and specify the > > feature required to enable the device. > > PATCH v1 -> v2 > > - fixed byte order sparse warning. ioread/write already does > > little-endian. > > - add include for linux/percpu.h > > RFC v1 -> PATCH v1 > > - renamed memctl to pvmemcontrol > > - defined device endianness as little endian > > > > v1: > > https://lore.kernel.org/linux-mm/20240518072422.771698-1-yuanchu@xxxxxxxxxx/ > > v2: > > https://lore.kernel.org/linux-mm/20240612021207.3314369-1-yuanchu@xxxxxxxxxx/ > > > > Change-Id: Ib9e4026df815a8ffd8d8b29ce13dd12ce3714e21 > > > > Add MADV_MERGEABLE to pvmemcontrol > > > > Align pvmemcontrol comments > > > > This change aligns the pvmemcontrol operation IDs and comments in the pvmemcontrol header file > > > > Signed-off-by: Yuanchu Xie <yuanchu@xxxxxxxxxx> > > --- > > .../userspace-api/ioctl/ioctl-number.rst | 2 + > > drivers/virt/Kconfig | 2 + > > drivers/virt/Makefile | 1 + > > drivers/virt/pvmemcontrol/Kconfig | 10 + > > drivers/virt/pvmemcontrol/Makefile | 2 + > > drivers/virt/pvmemcontrol/pvmemcontrol.c | 459 ++++++++++++++++++ > > include/uapi/linux/pvmemcontrol.h | 76 +++ > > 7 files changed, 552 insertions(+) > > create mode 100644 drivers/virt/pvmemcontrol/Kconfig > > create mode 100644 drivers/virt/pvmemcontrol/Makefile > > create mode 100644 drivers/virt/pvmemcontrol/pvmemcontrol.c > > create mode 100644 include/uapi/linux/pvmemcontrol.h > > > > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst > > index a141e8e65c5d..34a9954cafc7 100644 > > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst > > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst > > @@ -372,6 +372,8 @@ Code Seq# Include File Comments > > 0xCD 01 linux/reiserfs_fs.h > > 0xCE 01-02 uapi/linux/cxl_mem.h Compute Express Link Memory Devices > > 0xCF 02 fs/smb/client/cifs_ioctl.h > > +0xDA 00 uapi/linux/pvmemcontrol.h Pvmemcontrol Device > > + <mailto:yuanchu@xxxxxxxxxx> > > 0xDB 00-0F drivers/char/mwave/mwavepub.h > > 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/ > > <mailto:aherrman@xxxxxxxxxx> > > diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig > > index d8c848cf09a6..454e347a90cf 100644 > > --- a/drivers/virt/Kconfig > > +++ b/drivers/virt/Kconfig > > @@ -49,4 +49,6 @@ source "drivers/virt/acrn/Kconfig" > > > > source "drivers/virt/coco/Kconfig" > > > > +source "drivers/virt/pvmemcontrol/Kconfig" > > + > > endif > > diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile > > index f29901bd7820..3a1fd6e076ad 100644 > > --- a/drivers/virt/Makefile > > +++ b/drivers/virt/Makefile > > @@ -10,3 +10,4 @@ obj-y += vboxguest/ > > obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/ > > obj-$(CONFIG_ACRN_HSM) += acrn/ > > obj-y += coco/ > > +obj-$(CONFIG_PVMEMCONTROL) += pvmemcontrol/ > > diff --git a/drivers/virt/pvmemcontrol/Kconfig b/drivers/virt/pvmemcontrol/Kconfig > > new file mode 100644 > > index 000000000000..9fe16da23bd8 > > --- /dev/null > > +++ b/drivers/virt/pvmemcontrol/Kconfig > > @@ -0,0 +1,10 @@ > > +# SPDX-License-Identifier: GPL-2.0 > > +config PVMEMCONTROL > > + tristate "pvmemcontrol Guest Service Module" > > + depends on KVM_GUEST > > + help > > + pvmemcontrol is a guest kernel module that allows to communicate > > + with hypervisor / VMM and control the guest memory backing. > > + > > + To compile as a module, choose M, the module will be called > > + pvmemcontrol. If unsure, say N. > > diff --git a/drivers/virt/pvmemcontrol/Makefile b/drivers/virt/pvmemcontrol/Makefile > > new file mode 100644 > > index 000000000000..2fc087ef3ef5 > > --- /dev/null > > +++ b/drivers/virt/pvmemcontrol/Makefile > > @@ -0,0 +1,2 @@ > > +# SPDX-License-Identifier: GPL-2.0 > > +obj-$(CONFIG_PVMEMCONTROL) := pvmemcontrol.o > > diff --git a/drivers/virt/pvmemcontrol/pvmemcontrol.c b/drivers/virt/pvmemcontrol/pvmemcontrol.c > > new file mode 100644 > > index 000000000000..f8a07114fad8 > > --- /dev/null > > +++ b/drivers/virt/pvmemcontrol/pvmemcontrol.c > > @@ -0,0 +1,459 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Control guest physical memory properties by sending > > + * madvise-esque requests to the host VMM. > > + * > > + * Author: Yuanchu Xie <yuanchu@xxxxxxxxxx> > > + * Author: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> > > + */ > > +#include <linux/spinlock.h> > > +#include <linux/cpumask.h> > > +#include <linux/percpu-defs.h> > > +#include <linux/percpu.h> > > +#include <linux/types.h> > > +#include <linux/gfp.h> > > +#include <linux/compiler.h> > > +#include <linux/fs.h> > > +#include <linux/sched/clock.h> > > +#include <linux/wait.h> > > +#include <linux/printk.h> > > +#include <linux/slab.h> > > +#include <linux/miscdevice.h> > > +#include <linux/module.h> > > +#include <linux/proc_fs.h> > > +#include <linux/resource_ext.h> > > +#include <linux/mutex.h> > > +#include <linux/pci.h> > > +#include <linux/percpu.h> > > +#include <linux/byteorder/generic.h> > > +#include <linux/io-64-nonatomic-lo-hi.h> > > +#include <uapi/linux/pvmemcontrol.h> > > + > > +#define PCI_VENDOR_ID_GOOGLE 0x1ae0 > > +#define PCI_DEVICE_ID_GOOGLE_PVMEMCONTROL 0x0087 > > + > > +#define PVMEMCONTROL_COMMAND_OFFSET 0x08 > > +#define PVMEMCONTROL_REQUEST_OFFSET 0x00 > > +#define PVMEMCONTROL_RESPONSE_OFFSET 0x00 > > + > > +/* > > + * Magic values that perform the action specified when written to > > + * the command register. > > + */ > > +enum pvmemcontrol_transport_command { > > + PVMEMCONTROL_TRANSPORT_RESET = 0x060FE6D2, > > + PVMEMCONTROL_TRANSPORT_REGISTER = 0x0E359539, > > + PVMEMCONTROL_TRANSPORT_READY = 0x0CA8D227, > > + PVMEMCONTROL_TRANSPORT_DISCONNECT = 0x030F5DA0, > > + PVMEMCONTROL_TRANSPORT_ACK = 0x03CF5196, > > + PVMEMCONTROL_TRANSPORT_ERROR = 0x01FBA249, > > +}; > > + > > +/* Contains the function code and arguments for specific function */ > > +struct pvmemcontrol_vmm_call_le { > > + __le64 func_code; /* pvmemcontrol set function code */ > > + __le64 addr; /* hyper. page size aligned guest phys. addr */ > > + __le64 length; /* hyper. page size aligned length */ > > + __le64 arg; /* function code specific argument */ > > +}; > > + > > +/* Is filled on return to guest from VMM from most function calls */ > > +struct pvmemcontrol_vmm_ret_le { > > + __le32 ret_errno; /* on error, value of errno */ > > + __le32 ret_code; /* pvmemcontrol internal error code, on success 0 */ > > + __le64 ret_value; /* return value from the function call */ > > + __le64 arg0; /* currently unused */ > > + __le64 arg1; /* currently unused */ > > +}; > > + > > +struct pvmemcontrol_buf_le { > > + union { > > + struct pvmemcontrol_vmm_call_le call; > > + struct pvmemcontrol_vmm_ret_le ret; > > + }; > > +}; > > + > > +struct pvmemcontrol_percpu_channel { > > + struct pvmemcontrol_buf_le buf; > > + u64 buf_phys_addr; > > + u32 command; > > +}; > > + > > +struct pvmemcontrol { > > + void __iomem *base_addr; > > + struct device *device; > > + /* cache the info call */ > > + struct pvmemcontrol_vmm_ret pvmemcontrol_vmm_info; > > + struct pvmemcontrol_percpu_channel __percpu *pcpu_channels; > > +}; > > + > > +static DEFINE_RWLOCK(pvmemcontrol_lock); > > +static struct pvmemcontrol *pvmemcontrol __read_mostly; > > + > > +static void pvmemcontrol_write_command(void __iomem *base_addr, u32 command) > > +{ > > + iowrite32(command, base_addr + PVMEMCONTROL_COMMAND_OFFSET); > > +} > > + > > +static u32 pvmemcontrol_read_command(void __iomem *base_addr) > > +{ > > + return ioread32(base_addr + PVMEMCONTROL_COMMAND_OFFSET); > > +} > > + > > +static void pvmemcontrol_write_reg(void __iomem *base_addr, u64 buf_phys_addr) > > +{ > > + iowrite64_lo_hi(buf_phys_addr, base_addr + PVMEMCONTROL_REQUEST_OFFSET); > > +} > > + > > +static u32 pvmemcontrol_read_resp(void __iomem *base_addr) > > +{ > > + return ioread32(base_addr + PVMEMCONTROL_RESPONSE_OFFSET); > > +} > > + > > +static void pvmemcontrol_buf_call_to_le(struct pvmemcontrol_buf_le *le, > > + const struct pvmemcontrol_buf *buf) > > +{ > > + le->call.func_code = cpu_to_le64(buf->call.func_code); > > + le->call.addr = cpu_to_le64(buf->call.addr); > > + le->call.length = cpu_to_le64(buf->call.length); > > + le->call.arg = cpu_to_le64(buf->call.arg); > > +} > > + > > +static void pvmemcontrol_buf_ret_from_le(struct pvmemcontrol_buf *buf, > > + const struct pvmemcontrol_buf_le *le) > > +{ > > + buf->ret.ret_errno = le32_to_cpu(le->ret.ret_errno); > > + buf->ret.ret_code = le32_to_cpu(le->ret.ret_code); > > + buf->ret.ret_value = le64_to_cpu(le->ret.ret_value); > > + buf->ret.arg0 = le64_to_cpu(le->ret.arg0); > > + buf->ret.arg1 = le64_to_cpu(le->ret.arg1); > > +} > > + > > +static void pvmemcontrol_send_request(struct pvmemcontrol *pvmemcontrol, > > + struct pvmemcontrol_buf *buf) > > +{ > > + struct pvmemcontrol_percpu_channel *channel; > > + > > + preempt_disable(); > > + channel = this_cpu_ptr(pvmemcontrol->pcpu_channels); > > + > > + pvmemcontrol_buf_call_to_le(&channel->buf, buf); > > + pvmemcontrol_write_command(pvmemcontrol->base_addr, channel->command); > > + pvmemcontrol_buf_ret_from_le(buf, &channel->buf); > > + > > + preempt_enable(); > > +} > > + > > +static int __pvmemcontrol_vmm_call(struct pvmemcontrol_buf *buf) > > +{ > > + int err = 0; > > + > > + if (!pvmemcontrol) > > + return -EINVAL; > > + > > + read_lock(&pvmemcontrol_lock); > > + if (!pvmemcontrol) { > > + err = -EINVAL; > > + goto unlock; > > + } > > + if (buf->call.func_code == PVMEMCONTROL_INFO) { > > + memcpy(&buf->ret, &pvmemcontrol->pvmemcontrol_vmm_info, > > + sizeof(buf->ret)); > > + goto unlock; > > + } > > + > > + pvmemcontrol_send_request(pvmemcontrol, buf); > > + > > +unlock: > > + read_unlock(&pvmemcontrol_lock); > > + return err; > > +} > > + > > +static int pvmemcontrol_init_info(struct pvmemcontrol *dev, > > + struct pvmemcontrol_buf *buf) > > +{ > > + buf->call.func_code = PVMEMCONTROL_INFO; > > + > > + pvmemcontrol_send_request(dev, buf); > > + if (buf->ret.ret_code) > > + return buf->ret.ret_code; > > + > > + /* Initialize global pvmemcontrol_vmm_info */ > > + memcpy(&dev->pvmemcontrol_vmm_info, &buf->ret, > > + sizeof(dev->pvmemcontrol_vmm_info)); > > + dev_info(dev->device, > > + "pvmemcontrol_vmm_info.ret_errno = %u\n" > > + "pvmemcontrol_vmm_info.ret_code = %u\n" > > + "pvmemcontrol_vmm_info.major_version = %llu\n" > > + "pvmemcontrol_vmm_info.minor_version = %llu\n" > > + "pvmemcontrol_vmm_info.page_size = %llu\n", > > + dev->pvmemcontrol_vmm_info.ret_errno, > > + dev->pvmemcontrol_vmm_info.ret_code, > > + dev->pvmemcontrol_vmm_info.arg0, > > + dev->pvmemcontrol_vmm_info.arg1, > > + dev->pvmemcontrol_vmm_info.ret_value); > > + > > + return 0; > > +} > > + > > +static int pvmemcontrol_open(struct inode *inode, struct file *filp) > > +{ > > + struct pvmemcontrol_buf *buf = NULL; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return -EACCES; > > + > > + /* Do not allow exclusive open */ > > + if (filp->f_flags & O_EXCL) > > + return -EINVAL; > > + > > + buf = kzalloc(sizeof(struct pvmemcontrol_buf), GFP_KERNEL); > > + if (!buf) > > + return -ENOMEM; > > + > > + /* Overwrite the misc device set by misc_register */ > > + filp->private_data = buf; > > + return 0; > > +} > > + > > +static int pvmemcontrol_release(struct inode *inode, struct file *filp) > > +{ > > + kfree(filp->private_data); > > + filp->private_data = NULL; > > + return 0; > > +} > > + > > +static long pvmemcontrol_ioctl(struct file *filp, unsigned int cmd, > > + unsigned long ioctl_param) > > +{ > > + struct pvmemcontrol_buf *buf = filp->private_data; > > + int err; > > + > > + if (cmd != PVMEMCONTROL_IOCTL_VMM) > > + return -EINVAL; > > + > > + if (copy_from_user(&buf->call, (void __user *)ioctl_param, > > + sizeof(struct pvmemcontrol_buf))) > > + return -EFAULT; > > + > > + err = __pvmemcontrol_vmm_call(buf); > > + if (err) > > + return err; > > + > > + if (copy_to_user((void __user *)ioctl_param, &buf->ret, > > + sizeof(struct pvmemcontrol_buf))) > > + return -EFAULT; > > + > > + return 0; > > +} > > + > > +static const struct file_operations pvmemcontrol_fops = { > > + .owner = THIS_MODULE, > > + .open = pvmemcontrol_open, > > + .release = pvmemcontrol_release, > > + .unlocked_ioctl = pvmemcontrol_ioctl, > > + .compat_ioctl = compat_ptr_ioctl, > > +}; > > + > > +static struct miscdevice pvmemcontrol_dev = { > > + .minor = MISC_DYNAMIC_MINOR, > > + .name = KBUILD_MODNAME, > > + .fops = &pvmemcontrol_fops, > > +}; > > + > > +static int pvmemcontrol_connect(struct pvmemcontrol *pvmemcontrol) > > +{ > > + int cpu; > > + u32 cmd; > > + > > + pvmemcontrol_write_command(pvmemcontrol->base_addr, > > + PVMEMCONTROL_TRANSPORT_RESET); > > + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr); > > + if (cmd != PVMEMCONTROL_TRANSPORT_ACK) { > > + dev_err(pvmemcontrol->device, > > + "failed to reset device, cmd 0x%x\n", cmd); > > + return -EINVAL; > > + } > > + > > + for_each_possible_cpu(cpu) { > > + struct pvmemcontrol_percpu_channel *channel = > > + per_cpu_ptr(pvmemcontrol->pcpu_channels, cpu); > > + > > + pvmemcontrol_write_reg(pvmemcontrol->base_addr, > > + channel->buf_phys_addr); > > + pvmemcontrol_write_command(pvmemcontrol->base_addr, > > + PVMEMCONTROL_TRANSPORT_REGISTER); > > + > > + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr); > > + if (cmd != PVMEMCONTROL_TRANSPORT_ACK) { > > + dev_err(pvmemcontrol->device, > > + "failed to register pcpu buf, cmd 0x%x\n", cmd); > > + return -EINVAL; > > + } > > + channel->command = > > + pvmemcontrol_read_resp(pvmemcontrol->base_addr); > > + } > > + > > + pvmemcontrol_write_command(pvmemcontrol->base_addr, > > + PVMEMCONTROL_TRANSPORT_READY); > > + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr); > > + if (cmd != PVMEMCONTROL_TRANSPORT_ACK) { > > + dev_err(pvmemcontrol->device, > > + "failed to ready device, cmd 0x%x\n", cmd); > > + return -EINVAL; > > + } > > + return 0; > > +} > > + > > +static int pvmemcontrol_disconnect(struct pvmemcontrol *pvmemcontrol) > > +{ > > + u32 cmd; > > + > > + pvmemcontrol_write_command(pvmemcontrol->base_addr, > > + PVMEMCONTROL_TRANSPORT_DISCONNECT); > > + > > + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr); > > + if (cmd != PVMEMCONTROL_TRANSPORT_ERROR) { > > + dev_err(pvmemcontrol->device, > > + "failed to disconnect device, cmd 0x%x\n", cmd); > > + return -EINVAL; > > + } > > + return 0; > > +} > > + > > +static int pvmemcontrol_alloc_percpu_channels(struct pvmemcontrol *pvmemcontrol) > > +{ > > + int cpu; > > + > > + pvmemcontrol->pcpu_channels = alloc_percpu_gfp( > > + struct pvmemcontrol_percpu_channel, GFP_ATOMIC | __GFP_ZERO); > > + if (!pvmemcontrol->pcpu_channels) > > + return -ENOMEM; > > + > > + for_each_possible_cpu(cpu) { > > + struct pvmemcontrol_percpu_channel *channel = > > + per_cpu_ptr(pvmemcontrol->pcpu_channels, cpu); > > + phys_addr_t buf_phys = per_cpu_ptr_to_phys(&channel->buf); > > + > > + channel->buf_phys_addr = buf_phys; > > + } > > + return 0; > > +} > > + > > +static int pvmemcontrol_init(struct device *device, void __iomem *base_addr) > > +{ > > + struct pvmemcontrol_buf *buf = NULL; > > + struct pvmemcontrol *dev = NULL; > > + int err = 0; > > + > > + err = misc_register(&pvmemcontrol_dev); > > + if (err) > > + return err; > > + > > + /* We take a spinlock for a long time, but this is only during init. */ > > + write_lock(&pvmemcontrol_lock); > > + if (READ_ONCE(pvmemcontrol)) { > > + dev_warn(device, "multiple pvmemcontrol devices present\n"); > > + err = -EEXIST; > > + goto fail_free; > > + } > > + > > + dev = kzalloc(sizeof(struct pvmemcontrol), GFP_ATOMIC); > > + buf = kzalloc(sizeof(struct pvmemcontrol_buf), GFP_ATOMIC); > > + if (!dev || !buf) { > > + err = -ENOMEM; > > + goto fail_free; > > + } > > + > > + dev->base_addr = base_addr; > > + dev->device = device; > > + > > + err = pvmemcontrol_alloc_percpu_channels(dev); > > + if (err) > > + goto fail_free; > > + > > + err = pvmemcontrol_connect(dev); > > + if (err) > > + goto fail_free; > > + > > + err = pvmemcontrol_init_info(dev, buf); > > + if (err) > > + goto fail_free; > > + > > + WRITE_ONCE(pvmemcontrol, dev); > > + write_unlock(&pvmemcontrol_lock); > > + return 0; > > + > > +fail_free: > > + write_unlock(&pvmemcontrol_lock); > > + kfree(dev); > > + kfree(buf); > > + misc_deregister(&pvmemcontrol_dev); > > + return err; > > +} > > + > > +static int pvmemcontrol_pci_probe(struct pci_dev *dev, > > + const struct pci_device_id *id) > > +{ > > + void __iomem *base_addr; > > + int err; > > + > > + err = pcim_enable_device(dev); > > + if (err < 0) > > + return err; > > + > > + base_addr = pcim_iomap(dev, 0, 0); > > + if (!base_addr) > > + return -ENOMEM; > > + > > + err = pvmemcontrol_init(&dev->dev, base_addr); > > + if (err) > > + pci_disable_device(dev); > > + > > + return err; > > +} > > + > > +static void pvmemcontrol_pci_remove(struct pci_dev *pci_dev) > > +{ > > + int err; > > + struct pvmemcontrol *dev; > > + > > + write_lock(&pvmemcontrol_lock); > > + dev = READ_ONCE(pvmemcontrol); > > + if (!dev) { > > + err = -EINVAL; > > + dev_err(&pci_dev->dev, "cleanup called when uninitialized\n"); > > + write_unlock(&pvmemcontrol_lock); > > + return; > > + } > > + > > + /* disconnect */ > > + err = pvmemcontrol_disconnect(dev); > > + if (err) > > + dev_err(&pci_dev->dev, "device did not ack disconnect\n"); > > + /* free percpu channels */ > > + free_percpu(dev->pcpu_channels); > > + > > + kfree(dev); > > + WRITE_ONCE(pvmemcontrol, NULL); > > + write_unlock(&pvmemcontrol_lock); > > + misc_deregister(&pvmemcontrol_dev); > > +} > > + > > +static const struct pci_device_id pvmemcontrol_pci_id_tbl[] = { > > + { PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEVICE_ID_GOOGLE_PVMEMCONTROL) }, > > + { 0 } > > +}; > > +MODULE_DEVICE_TABLE(pci, pvmemcontrol_pci_id_tbl); > > + > > +static struct pci_driver pvmemcontrol_pci_driver = { > > + .name = "pvmemcontrol", > > + .id_table = pvmemcontrol_pci_id_tbl, > > + .probe = pvmemcontrol_pci_probe, > > + .remove = pvmemcontrol_pci_remove, > > +}; > > +module_pci_driver(pvmemcontrol_pci_driver); > > + > > +MODULE_AUTHOR("Yuanchu Xie <yuanchu@xxxxxxxxxx>"); > > +MODULE_DESCRIPTION("pvmemcontrol Guest Service Module"); > > +MODULE_LICENSE("GPL"); > > diff --git a/include/uapi/linux/pvmemcontrol.h b/include/uapi/linux/pvmemcontrol.h > > new file mode 100644 > > index 000000000000..31b366dee796 > > --- /dev/null > > +++ b/include/uapi/linux/pvmemcontrol.h > > @@ -0,0 +1,76 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +/* > > + * Userspace interface for /dev/pvmemcontrol > > + * pvmemcontrol Guest Memory Service Module > > + * > > + * Copyright (c) 2024, Google LLC. > > + * Yuanchu Xie <yuanchu@xxxxxxxxxx> > > + * Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> > > + */ > > + > > +#ifndef _UAPI_PVMEMCONTROL_H > > +#define _UAPI_PVMEMCONTROL_H > > + > > +#include <linux/wait.h> > > +#include <linux/types.h> > > +#include <asm/param.h> > > + > > +/* Contains the function code and arguments for specific function */ > > +struct pvmemcontrol_vmm_call { > > + __u64 func_code; /* pvmemcontrol set function code */ > > + __u64 addr; /* hyper. page size aligned guest phys. addr */ > > + __u64 length; /* hyper. page size aligned length */ > > + __u64 arg; /* function code specific argument */ > > +}; > > + > > +/* Is filled on return to guest from VMM from most function calls */ > > +struct pvmemcontrol_vmm_ret { > > + __u32 ret_errno; /* on error, value of errno */ > > + __u32 ret_code; /* pvmemcontrol internal error code, on success 0 */ > > + __u64 ret_value; /* return value from the function call */ > > + __u64 arg0; /* major version for func_code INFO */ > > + __u64 arg1; /* minor version for func_code INFO */ > > +}; > > + > > +struct pvmemcontrol_buf { > > + union { > > + struct pvmemcontrol_vmm_call call; > > + struct pvmemcontrol_vmm_ret ret; > > + }; > > +}; > > + > > +/* The ioctl type, documented in ioctl-number.rst */ > > +#define PVMEMCONTROL_IOCTL_TYPE 0xDA > > + > > +#define PVMEMCONTROL_IOCTL_VMM _IOWR(PVMEMCONTROL_IOCTL_TYPE, 0x00, struct pvmemcontrol_buf) > > + > > +/* > > + * Returns the host page size in ret_value. > > + * major version in arg0. > > + * minor version in arg1. > > + */ > > +#define PVMEMCONTROL_INFO 0 > > + > > +/* Pvmemcontrol calls, pvmemcontrol_vmm_return is returned */ > > +#define PVMEMCONTROL_DONTNEED 1 /* madvise(addr, len, MADV_DONTNEED); */ > > +#define PVMEMCONTROL_REMOVE 2 /* madvise(addr, len, MADV_MADV_REMOVE); */ > > +#define PVMEMCONTROL_FREE 3 /* madvise(addr, len, MADV_FREE); */ > > +#define PVMEMCONTROL_PAGEOUT 4 /* madvise(addr, len, MADV_PAGEOUT); */ > > +#define PVMEMCONTROL_DONTDUMP 5 /* madvise(addr, len, MADV_DONTDUMP); */ > > + > > +/* prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, addr, len, arg) */ > > +#define PVMEMCONTROL_SET_VMA_ANON_NAME 6 > > + > > +#define PVMEMCONTROL_MLOCK 7 /* mlock2(addr, len, 0) */ > > +#define PVMEMCONTROL_MUNLOCK 8 /* munlock(addr, len) */ > > + > > +#define PVMEMCONTROL_MPROTECT_NONE 9 /* mprotect(addr, len, PROT_NONE) */ > > +#define PVMEMCONTROL_MPROTECT_R 10 /* mprotect(addr, len, PROT_READ) */ > > +#define PVMEMCONTROL_MPROTECT_W 11 /* mprotect(addr, len, PROT_WRITE) */ > > +/* mprotect(addr, len, PROT_READ | PROT_WRITE) */ > > +#define PVMEMCONTROL_MPROTECT_RW 12 > > + > > +#define PVMEMCONTROL_MERGEABLE 13 /* madvise(addr, len, MADV_MERGEABLE); */ > > +#define PVMEMCONTROL_UNMERGEABLE 14 /* madvise(addr, len, MADV_UNMERGEABLE); */ > > + > > +#endif /* _UAPI_PVMEMCONTROL_H */ > > -- > > 2.46.1.824.gd892dcdcdd-goog > >