Re: [PATCH v2 1/2] virt: pvmemcontrol: control guest physical memory properties

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I made a mistake. This is supposed to be v3.

On Mon, Sep 30, 2024 at 6:13 PM Yuanchu Xie <yuanchu@xxxxxxxxxx> wrote:
>
> Pvmemcontrol provides a way for the guest to control its physical memory
> properties, and enables optimizations and security features. For
> example, the guest can provide information to the host where parts of a
> hugepage may be unbacked, or sensitive data may not be swapped out, etc.
>
> Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
> and also some other properties of the memory map the back's host memory.
> This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
> capability is available, the changes in the backing of the memory region
> on the host are automatically reflected into the guest. For example, an
> mmap() or madvise() that affects the region will be made visible
> immediately.
>
> There are two components of the implementation: the guest Linux driver
> and Virtual Machine Monitor (VMM) device. A guest-allocated shared
> buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
> device assigns a unique command for each per-cpu buffer. The guest
> writes its pvmemcontrol request in the per-cpu buffer, then writes the
> corresponding command into the command register, calling into the VMM
> device to perform the pvmemcontrol request.
>
> The synchronous per-cpu shared buffer approach avoids the kick and busy
> waiting that the guest would have to do with virtio virtqueue transport.
>
> User API
> From the userland, the pvmemcontrol guest driver is controlled via
> ioctl(2) call. It requires CAP_SYS_ADMIN.
>
> ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);
>
> Guest userland applications can tag VMAs and guest hugepages, or advise
> the host on how to handle sensitive guest pages.
>
> Supported function codes and their use cases:
> PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce
> the struct page and page table lookup overhead by using hugepages backed
> by smaller pages on the host. These pvmemcontrol commands can allow for
> partial freeing of private guest hugepages to save memory. They also
> allow kernel memory, such as kernel stacks and task_structs to be
> paravirtualized if we expose kernel APIs.
>
> PVMEMCONTROL_MERGEABLE can inform the host KSM to deduplicate VM pages.
>
> PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not
> want to share its backing pages.
> The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included
> in a dump.
> MLOCK/UNLOCK can advise the host that sensitive information is not
> swapped out on the host.
>
> PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages,
> stack guard pages can be handled in the host and memory can be saved in
> the hugepage.
>
> PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging
> how guest memory is being mapped on the host.
>
> Sample program making use of PVMEMCONTROL_DONTNEED:
> https://github.com/Dummyc0m/pvmemcontrol-user
>
> The VMM implementation is part of Cloud Hypervisor, the feature
> pvmemcontrol can be enabled and the VMM can then provide the device to a
> supporting guest.
> https://github.com/cloud-hypervisor/cloud-hypervisor
>
> -
> Changelog
> PATCH v2 -> v3
> - added PVMEMCONTROL_MERGEABLE for memory dedupe.
> - updated link to the upstream Cloud Hypervisor repo, and specify the
>   feature required to enable the device.
> PATCH v1 -> v2
> - fixed byte order sparse warning. ioread/write already does
>   little-endian.
> - add include for linux/percpu.h
> RFC v1 -> PATCH v1
> - renamed memctl to pvmemcontrol
> - defined device endianness as little endian
>
> v1:
> https://lore.kernel.org/linux-mm/20240518072422.771698-1-yuanchu@xxxxxxxxxx/
> v2:
> https://lore.kernel.org/linux-mm/20240612021207.3314369-1-yuanchu@xxxxxxxxxx/
>
> Change-Id: Ib9e4026df815a8ffd8d8b29ce13dd12ce3714e21
>
> Add MADV_MERGEABLE to pvmemcontrol
>
> Align pvmemcontrol comments
>
> This change aligns the pvmemcontrol operation IDs and comments in the pvmemcontrol header file
>
> Signed-off-by: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |   2 +
>  drivers/virt/Kconfig                          |   2 +
>  drivers/virt/Makefile                         |   1 +
>  drivers/virt/pvmemcontrol/Kconfig             |  10 +
>  drivers/virt/pvmemcontrol/Makefile            |   2 +
>  drivers/virt/pvmemcontrol/pvmemcontrol.c      | 459 ++++++++++++++++++
>  include/uapi/linux/pvmemcontrol.h             |  76 +++
>  7 files changed, 552 insertions(+)
>  create mode 100644 drivers/virt/pvmemcontrol/Kconfig
>  create mode 100644 drivers/virt/pvmemcontrol/Makefile
>  create mode 100644 drivers/virt/pvmemcontrol/pvmemcontrol.c
>  create mode 100644 include/uapi/linux/pvmemcontrol.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a141e8e65c5d..34a9954cafc7 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -372,6 +372,8 @@ Code  Seq#    Include File                                           Comments
>  0xCD  01     linux/reiserfs_fs.h
>  0xCE  01-02  uapi/linux/cxl_mem.h                                    Compute Express Link Memory Devices
>  0xCF  02     fs/smb/client/cifs_ioctl.h
> +0xDA  00     uapi/linux/pvmemcontrol.h                               Pvmemcontrol Device
> +                                                                     <mailto:yuanchu@xxxxxxxxxx>
>  0xDB  00-0F  drivers/char/mwave/mwavepub.h
>  0xDD  00-3F                                                          ZFCP device driver see drivers/s390/scsi/
>                                                                       <mailto:aherrman@xxxxxxxxxx>
> diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
> index d8c848cf09a6..454e347a90cf 100644
> --- a/drivers/virt/Kconfig
> +++ b/drivers/virt/Kconfig
> @@ -49,4 +49,6 @@ source "drivers/virt/acrn/Kconfig"
>
>  source "drivers/virt/coco/Kconfig"
>
> +source "drivers/virt/pvmemcontrol/Kconfig"
> +
>  endif
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index f29901bd7820..3a1fd6e076ad 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -10,3 +10,4 @@ obj-y                         += vboxguest/
>  obj-$(CONFIG_NITRO_ENCLAVES)   += nitro_enclaves/
>  obj-$(CONFIG_ACRN_HSM)         += acrn/
>  obj-y                          += coco/
> +obj-$(CONFIG_PVMEMCONTROL)     += pvmemcontrol/
> diff --git a/drivers/virt/pvmemcontrol/Kconfig b/drivers/virt/pvmemcontrol/Kconfig
> new file mode 100644
> index 000000000000..9fe16da23bd8
> --- /dev/null
> +++ b/drivers/virt/pvmemcontrol/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config PVMEMCONTROL
> +       tristate "pvmemcontrol Guest Service Module"
> +       depends on KVM_GUEST
> +       help
> +         pvmemcontrol is a guest kernel module that allows to communicate
> +         with hypervisor / VMM and control the guest memory backing.
> +
> +         To compile as a module, choose M, the module will be called
> +         pvmemcontrol. If unsure, say N.
> diff --git a/drivers/virt/pvmemcontrol/Makefile b/drivers/virt/pvmemcontrol/Makefile
> new file mode 100644
> index 000000000000..2fc087ef3ef5
> --- /dev/null
> +++ b/drivers/virt/pvmemcontrol/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_PVMEMCONTROL)     := pvmemcontrol.o
> diff --git a/drivers/virt/pvmemcontrol/pvmemcontrol.c b/drivers/virt/pvmemcontrol/pvmemcontrol.c
> new file mode 100644
> index 000000000000..f8a07114fad8
> --- /dev/null
> +++ b/drivers/virt/pvmemcontrol/pvmemcontrol.c
> @@ -0,0 +1,459 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Control guest physical memory properties by sending
> + * madvise-esque requests to the host VMM.
> + *
> + * Author: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> + * Author: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx>
> + */
> +#include <linux/spinlock.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/percpu.h>
> +#include <linux/types.h>
> +#include <linux/gfp.h>
> +#include <linux/compiler.h>
> +#include <linux/fs.h>
> +#include <linux/sched/clock.h>
> +#include <linux/wait.h>
> +#include <linux/printk.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/proc_fs.h>
> +#include <linux/resource_ext.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/percpu.h>
> +#include <linux/byteorder/generic.h>
> +#include <linux/io-64-nonatomic-lo-hi.h>
> +#include <uapi/linux/pvmemcontrol.h>
> +
> +#define PCI_VENDOR_ID_GOOGLE 0x1ae0
> +#define PCI_DEVICE_ID_GOOGLE_PVMEMCONTROL 0x0087
> +
> +#define PVMEMCONTROL_COMMAND_OFFSET 0x08
> +#define PVMEMCONTROL_REQUEST_OFFSET 0x00
> +#define PVMEMCONTROL_RESPONSE_OFFSET 0x00
> +
> +/*
> + * Magic values that perform the action specified when written to
> + * the command register.
> + */
> +enum pvmemcontrol_transport_command {
> +       PVMEMCONTROL_TRANSPORT_RESET = 0x060FE6D2,
> +       PVMEMCONTROL_TRANSPORT_REGISTER = 0x0E359539,
> +       PVMEMCONTROL_TRANSPORT_READY = 0x0CA8D227,
> +       PVMEMCONTROL_TRANSPORT_DISCONNECT = 0x030F5DA0,
> +       PVMEMCONTROL_TRANSPORT_ACK = 0x03CF5196,
> +       PVMEMCONTROL_TRANSPORT_ERROR = 0x01FBA249,
> +};
> +
> +/* Contains the function code and arguments for specific function */
> +struct pvmemcontrol_vmm_call_le {
> +       __le64 func_code; /* pvmemcontrol set function code */
> +       __le64 addr; /* hyper. page size aligned guest phys. addr */
> +       __le64 length; /* hyper. page size aligned length */
> +       __le64 arg; /* function code specific argument */
> +};
> +
> +/* Is filled on return to guest from VMM from most function calls */
> +struct pvmemcontrol_vmm_ret_le {
> +       __le32 ret_errno; /* on error, value of errno */
> +       __le32 ret_code; /* pvmemcontrol internal error code, on success 0 */
> +       __le64 ret_value; /* return value from the function call */
> +       __le64 arg0; /* currently unused */
> +       __le64 arg1; /* currently unused */
> +};
> +
> +struct pvmemcontrol_buf_le {
> +       union {
> +               struct pvmemcontrol_vmm_call_le call;
> +               struct pvmemcontrol_vmm_ret_le ret;
> +       };
> +};
> +
> +struct pvmemcontrol_percpu_channel {
> +       struct pvmemcontrol_buf_le buf;
> +       u64 buf_phys_addr;
> +       u32 command;
> +};
> +
> +struct pvmemcontrol {
> +       void __iomem *base_addr;
> +       struct device *device;
> +       /* cache the info call */
> +       struct pvmemcontrol_vmm_ret pvmemcontrol_vmm_info;
> +       struct pvmemcontrol_percpu_channel __percpu *pcpu_channels;
> +};
> +
> +static DEFINE_RWLOCK(pvmemcontrol_lock);
> +static struct pvmemcontrol *pvmemcontrol __read_mostly;
> +
> +static void pvmemcontrol_write_command(void __iomem *base_addr, u32 command)
> +{
> +       iowrite32(command, base_addr + PVMEMCONTROL_COMMAND_OFFSET);
> +}
> +
> +static u32 pvmemcontrol_read_command(void __iomem *base_addr)
> +{
> +       return ioread32(base_addr + PVMEMCONTROL_COMMAND_OFFSET);
> +}
> +
> +static void pvmemcontrol_write_reg(void __iomem *base_addr, u64 buf_phys_addr)
> +{
> +       iowrite64_lo_hi(buf_phys_addr, base_addr + PVMEMCONTROL_REQUEST_OFFSET);
> +}
> +
> +static u32 pvmemcontrol_read_resp(void __iomem *base_addr)
> +{
> +       return ioread32(base_addr + PVMEMCONTROL_RESPONSE_OFFSET);
> +}
> +
> +static void pvmemcontrol_buf_call_to_le(struct pvmemcontrol_buf_le *le,
> +                                       const struct pvmemcontrol_buf *buf)
> +{
> +       le->call.func_code = cpu_to_le64(buf->call.func_code);
> +       le->call.addr = cpu_to_le64(buf->call.addr);
> +       le->call.length = cpu_to_le64(buf->call.length);
> +       le->call.arg = cpu_to_le64(buf->call.arg);
> +}
> +
> +static void pvmemcontrol_buf_ret_from_le(struct pvmemcontrol_buf *buf,
> +                                        const struct pvmemcontrol_buf_le *le)
> +{
> +       buf->ret.ret_errno = le32_to_cpu(le->ret.ret_errno);
> +       buf->ret.ret_code = le32_to_cpu(le->ret.ret_code);
> +       buf->ret.ret_value = le64_to_cpu(le->ret.ret_value);
> +       buf->ret.arg0 = le64_to_cpu(le->ret.arg0);
> +       buf->ret.arg1 = le64_to_cpu(le->ret.arg1);
> +}
> +
> +static void pvmemcontrol_send_request(struct pvmemcontrol *pvmemcontrol,
> +                                     struct pvmemcontrol_buf *buf)
> +{
> +       struct pvmemcontrol_percpu_channel *channel;
> +
> +       preempt_disable();
> +       channel = this_cpu_ptr(pvmemcontrol->pcpu_channels);
> +
> +       pvmemcontrol_buf_call_to_le(&channel->buf, buf);
> +       pvmemcontrol_write_command(pvmemcontrol->base_addr, channel->command);
> +       pvmemcontrol_buf_ret_from_le(buf, &channel->buf);
> +
> +       preempt_enable();
> +}
> +
> +static int __pvmemcontrol_vmm_call(struct pvmemcontrol_buf *buf)
> +{
> +       int err = 0;
> +
> +       if (!pvmemcontrol)
> +               return -EINVAL;
> +
> +       read_lock(&pvmemcontrol_lock);
> +       if (!pvmemcontrol) {
> +               err = -EINVAL;
> +               goto unlock;
> +       }
> +       if (buf->call.func_code == PVMEMCONTROL_INFO) {
> +               memcpy(&buf->ret, &pvmemcontrol->pvmemcontrol_vmm_info,
> +                      sizeof(buf->ret));
> +               goto unlock;
> +       }
> +
> +       pvmemcontrol_send_request(pvmemcontrol, buf);
> +
> +unlock:
> +       read_unlock(&pvmemcontrol_lock);
> +       return err;
> +}
> +
> +static int pvmemcontrol_init_info(struct pvmemcontrol *dev,
> +                                 struct pvmemcontrol_buf *buf)
> +{
> +       buf->call.func_code = PVMEMCONTROL_INFO;
> +
> +       pvmemcontrol_send_request(dev, buf);
> +       if (buf->ret.ret_code)
> +               return buf->ret.ret_code;
> +
> +       /* Initialize global pvmemcontrol_vmm_info */
> +       memcpy(&dev->pvmemcontrol_vmm_info, &buf->ret,
> +              sizeof(dev->pvmemcontrol_vmm_info));
> +       dev_info(dev->device,
> +                "pvmemcontrol_vmm_info.ret_errno = %u\n"
> +                "pvmemcontrol_vmm_info.ret_code = %u\n"
> +                "pvmemcontrol_vmm_info.major_version = %llu\n"
> +                "pvmemcontrol_vmm_info.minor_version = %llu\n"
> +                "pvmemcontrol_vmm_info.page_size = %llu\n",
> +                dev->pvmemcontrol_vmm_info.ret_errno,
> +                dev->pvmemcontrol_vmm_info.ret_code,
> +                dev->pvmemcontrol_vmm_info.arg0,
> +                dev->pvmemcontrol_vmm_info.arg1,
> +                dev->pvmemcontrol_vmm_info.ret_value);
> +
> +       return 0;
> +}
> +
> +static int pvmemcontrol_open(struct inode *inode, struct file *filp)
> +{
> +       struct pvmemcontrol_buf *buf = NULL;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EACCES;
> +
> +       /* Do not allow exclusive open */
> +       if (filp->f_flags & O_EXCL)
> +               return -EINVAL;
> +
> +       buf = kzalloc(sizeof(struct pvmemcontrol_buf), GFP_KERNEL);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       /* Overwrite the misc device set by misc_register */
> +       filp->private_data = buf;
> +       return 0;
> +}
> +
> +static int pvmemcontrol_release(struct inode *inode, struct file *filp)
> +{
> +       kfree(filp->private_data);
> +       filp->private_data = NULL;
> +       return 0;
> +}
> +
> +static long pvmemcontrol_ioctl(struct file *filp, unsigned int cmd,
> +                              unsigned long ioctl_param)
> +{
> +       struct pvmemcontrol_buf *buf = filp->private_data;
> +       int err;
> +
> +       if (cmd != PVMEMCONTROL_IOCTL_VMM)
> +               return -EINVAL;
> +
> +       if (copy_from_user(&buf->call, (void __user *)ioctl_param,
> +                          sizeof(struct pvmemcontrol_buf)))
> +               return -EFAULT;
> +
> +       err = __pvmemcontrol_vmm_call(buf);
> +       if (err)
> +               return err;
> +
> +       if (copy_to_user((void __user *)ioctl_param, &buf->ret,
> +                        sizeof(struct pvmemcontrol_buf)))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +
> +static const struct file_operations pvmemcontrol_fops = {
> +       .owner = THIS_MODULE,
> +       .open = pvmemcontrol_open,
> +       .release = pvmemcontrol_release,
> +       .unlocked_ioctl = pvmemcontrol_ioctl,
> +       .compat_ioctl = compat_ptr_ioctl,
> +};
> +
> +static struct miscdevice pvmemcontrol_dev = {
> +       .minor = MISC_DYNAMIC_MINOR,
> +       .name = KBUILD_MODNAME,
> +       .fops = &pvmemcontrol_fops,
> +};
> +
> +static int pvmemcontrol_connect(struct pvmemcontrol *pvmemcontrol)
> +{
> +       int cpu;
> +       u32 cmd;
> +
> +       pvmemcontrol_write_command(pvmemcontrol->base_addr,
> +                                  PVMEMCONTROL_TRANSPORT_RESET);
> +       cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> +       if (cmd != PVMEMCONTROL_TRANSPORT_ACK) {
> +               dev_err(pvmemcontrol->device,
> +                       "failed to reset device, cmd 0x%x\n", cmd);
> +               return -EINVAL;
> +       }
> +
> +       for_each_possible_cpu(cpu) {
> +               struct pvmemcontrol_percpu_channel *channel =
> +                       per_cpu_ptr(pvmemcontrol->pcpu_channels, cpu);
> +
> +               pvmemcontrol_write_reg(pvmemcontrol->base_addr,
> +                                      channel->buf_phys_addr);
> +               pvmemcontrol_write_command(pvmemcontrol->base_addr,
> +                                          PVMEMCONTROL_TRANSPORT_REGISTER);
> +
> +               cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> +               if (cmd != PVMEMCONTROL_TRANSPORT_ACK) {
> +                       dev_err(pvmemcontrol->device,
> +                               "failed to register pcpu buf, cmd 0x%x\n", cmd);
> +                       return -EINVAL;
> +               }
> +               channel->command =
> +                       pvmemcontrol_read_resp(pvmemcontrol->base_addr);
> +       }
> +
> +       pvmemcontrol_write_command(pvmemcontrol->base_addr,
> +                                  PVMEMCONTROL_TRANSPORT_READY);
> +       cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> +       if (cmd != PVMEMCONTROL_TRANSPORT_ACK) {
> +               dev_err(pvmemcontrol->device,
> +                       "failed to ready device, cmd 0x%x\n", cmd);
> +               return -EINVAL;
> +       }
> +       return 0;
> +}
> +
> +static int pvmemcontrol_disconnect(struct pvmemcontrol *pvmemcontrol)
> +{
> +       u32 cmd;
> +
> +       pvmemcontrol_write_command(pvmemcontrol->base_addr,
> +                                  PVMEMCONTROL_TRANSPORT_DISCONNECT);
> +
> +       cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> +       if (cmd != PVMEMCONTROL_TRANSPORT_ERROR) {
> +               dev_err(pvmemcontrol->device,
> +                       "failed to disconnect device, cmd 0x%x\n", cmd);
> +               return -EINVAL;
> +       }
> +       return 0;
> +}
> +
> +static int pvmemcontrol_alloc_percpu_channels(struct pvmemcontrol *pvmemcontrol)
> +{
> +       int cpu;
> +
> +       pvmemcontrol->pcpu_channels = alloc_percpu_gfp(
> +               struct pvmemcontrol_percpu_channel, GFP_ATOMIC | __GFP_ZERO);
> +       if (!pvmemcontrol->pcpu_channels)
> +               return -ENOMEM;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct pvmemcontrol_percpu_channel *channel =
> +                       per_cpu_ptr(pvmemcontrol->pcpu_channels, cpu);
> +               phys_addr_t buf_phys = per_cpu_ptr_to_phys(&channel->buf);
> +
> +               channel->buf_phys_addr = buf_phys;
> +       }
> +       return 0;
> +}
> +
> +static int pvmemcontrol_init(struct device *device, void __iomem *base_addr)
> +{
> +       struct pvmemcontrol_buf *buf = NULL;
> +       struct pvmemcontrol *dev = NULL;
> +       int err = 0;
> +
> +       err = misc_register(&pvmemcontrol_dev);
> +       if (err)
> +               return err;
> +
> +       /* We take a spinlock for a long time, but this is only during init. */
> +       write_lock(&pvmemcontrol_lock);
> +       if (READ_ONCE(pvmemcontrol)) {
> +               dev_warn(device, "multiple pvmemcontrol devices present\n");
> +               err = -EEXIST;
> +               goto fail_free;
> +       }
> +
> +       dev = kzalloc(sizeof(struct pvmemcontrol), GFP_ATOMIC);
> +       buf = kzalloc(sizeof(struct pvmemcontrol_buf), GFP_ATOMIC);
> +       if (!dev || !buf) {
> +               err = -ENOMEM;
> +               goto fail_free;
> +       }
> +
> +       dev->base_addr = base_addr;
> +       dev->device = device;
> +
> +       err = pvmemcontrol_alloc_percpu_channels(dev);
> +       if (err)
> +               goto fail_free;
> +
> +       err = pvmemcontrol_connect(dev);
> +       if (err)
> +               goto fail_free;
> +
> +       err = pvmemcontrol_init_info(dev, buf);
> +       if (err)
> +               goto fail_free;
> +
> +       WRITE_ONCE(pvmemcontrol, dev);
> +       write_unlock(&pvmemcontrol_lock);
> +       return 0;
> +
> +fail_free:
> +       write_unlock(&pvmemcontrol_lock);
> +       kfree(dev);
> +       kfree(buf);
> +       misc_deregister(&pvmemcontrol_dev);
> +       return err;
> +}
> +
> +static int pvmemcontrol_pci_probe(struct pci_dev *dev,
> +                                 const struct pci_device_id *id)
> +{
> +       void __iomem *base_addr;
> +       int err;
> +
> +       err = pcim_enable_device(dev);
> +       if (err < 0)
> +               return err;
> +
> +       base_addr = pcim_iomap(dev, 0, 0);
> +       if (!base_addr)
> +               return -ENOMEM;
> +
> +       err = pvmemcontrol_init(&dev->dev, base_addr);
> +       if (err)
> +               pci_disable_device(dev);
> +
> +       return err;
> +}
> +
> +static void pvmemcontrol_pci_remove(struct pci_dev *pci_dev)
> +{
> +       int err;
> +       struct pvmemcontrol *dev;
> +
> +       write_lock(&pvmemcontrol_lock);
> +       dev = READ_ONCE(pvmemcontrol);
> +       if (!dev) {
> +               err = -EINVAL;
> +               dev_err(&pci_dev->dev, "cleanup called when uninitialized\n");
> +               write_unlock(&pvmemcontrol_lock);
> +               return;
> +       }
> +
> +       /* disconnect */
> +       err = pvmemcontrol_disconnect(dev);
> +       if (err)
> +               dev_err(&pci_dev->dev, "device did not ack disconnect\n");
> +       /* free percpu channels */
> +       free_percpu(dev->pcpu_channels);
> +
> +       kfree(dev);
> +       WRITE_ONCE(pvmemcontrol, NULL);
> +       write_unlock(&pvmemcontrol_lock);
> +       misc_deregister(&pvmemcontrol_dev);
> +}
> +
> +static const struct pci_device_id pvmemcontrol_pci_id_tbl[] = {
> +       { PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEVICE_ID_GOOGLE_PVMEMCONTROL) },
> +       { 0 }
> +};
> +MODULE_DEVICE_TABLE(pci, pvmemcontrol_pci_id_tbl);
> +
> +static struct pci_driver pvmemcontrol_pci_driver = {
> +       .name = "pvmemcontrol",
> +       .id_table = pvmemcontrol_pci_id_tbl,
> +       .probe = pvmemcontrol_pci_probe,
> +       .remove = pvmemcontrol_pci_remove,
> +};
> +module_pci_driver(pvmemcontrol_pci_driver);
> +
> +MODULE_AUTHOR("Yuanchu Xie <yuanchu@xxxxxxxxxx>");
> +MODULE_DESCRIPTION("pvmemcontrol Guest Service Module");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/pvmemcontrol.h b/include/uapi/linux/pvmemcontrol.h
> new file mode 100644
> index 000000000000..31b366dee796
> --- /dev/null
> +++ b/include/uapi/linux/pvmemcontrol.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Userspace interface for /dev/pvmemcontrol
> + * pvmemcontrol Guest Memory Service Module
> + *
> + * Copyright (c) 2024, Google LLC.
> + * Yuanchu Xie <yuanchu@xxxxxxxxxx>
> + * Pasha Tatashin <pasha.tatashin@xxxxxxxxxx>
> + */
> +
> +#ifndef _UAPI_PVMEMCONTROL_H
> +#define _UAPI_PVMEMCONTROL_H
> +
> +#include <linux/wait.h>
> +#include <linux/types.h>
> +#include <asm/param.h>
> +
> +/* Contains the function code and arguments for specific function */
> +struct pvmemcontrol_vmm_call {
> +       __u64 func_code;        /* pvmemcontrol set function code */
> +       __u64 addr;             /* hyper. page size aligned guest phys. addr */
> +       __u64 length;           /* hyper. page size aligned length */
> +       __u64 arg;              /* function code specific argument */
> +};
> +
> +/* Is filled on return to guest from VMM from most function calls */
> +struct pvmemcontrol_vmm_ret {
> +       __u32 ret_errno;        /* on error, value of errno */
> +       __u32 ret_code;         /* pvmemcontrol internal error code, on success 0 */
> +       __u64 ret_value;        /* return value from the function call */
> +       __u64 arg0;             /* major version for func_code INFO */
> +       __u64 arg1;             /* minor version for func_code INFO */
> +};
> +
> +struct pvmemcontrol_buf {
> +       union {
> +               struct pvmemcontrol_vmm_call call;
> +               struct pvmemcontrol_vmm_ret ret;
> +       };
> +};
> +
> +/* The ioctl type, documented in ioctl-number.rst */
> +#define PVMEMCONTROL_IOCTL_TYPE                0xDA
> +
> +#define PVMEMCONTROL_IOCTL_VMM _IOWR(PVMEMCONTROL_IOCTL_TYPE, 0x00, struct pvmemcontrol_buf)
> +
> +/*
> + * Returns the host page size in ret_value.
> + * major version in arg0.
> + * minor version in arg1.
> + */
> +#define PVMEMCONTROL_INFO              0
> +
> +/* Pvmemcontrol calls, pvmemcontrol_vmm_return is returned */
> +#define PVMEMCONTROL_DONTNEED          1 /* madvise(addr, len, MADV_DONTNEED); */
> +#define PVMEMCONTROL_REMOVE            2 /* madvise(addr, len, MADV_MADV_REMOVE); */
> +#define PVMEMCONTROL_FREE              3 /* madvise(addr, len, MADV_FREE); */
> +#define PVMEMCONTROL_PAGEOUT           4 /* madvise(addr, len, MADV_PAGEOUT); */
> +#define PVMEMCONTROL_DONTDUMP          5 /* madvise(addr, len, MADV_DONTDUMP); */
> +
> +/* prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, addr, len, arg) */
> +#define PVMEMCONTROL_SET_VMA_ANON_NAME  6
> +
> +#define PVMEMCONTROL_MLOCK             7 /* mlock2(addr, len, 0) */
> +#define PVMEMCONTROL_MUNLOCK           8 /* munlock(addr, len) */
> +
> +#define PVMEMCONTROL_MPROTECT_NONE     9 /* mprotect(addr, len, PROT_NONE) */
> +#define PVMEMCONTROL_MPROTECT_R               10 /* mprotect(addr, len, PROT_READ) */
> +#define PVMEMCONTROL_MPROTECT_W               11 /* mprotect(addr, len, PROT_WRITE) */
> +/* mprotect(addr, len, PROT_READ | PROT_WRITE) */
> +#define PVMEMCONTROL_MPROTECT_RW       12
> +
> +#define PVMEMCONTROL_MERGEABLE         13 /* madvise(addr, len, MADV_MERGEABLE); */
> +#define PVMEMCONTROL_UNMERGEABLE       14 /* madvise(addr, len, MADV_UNMERGEABLE); */
> +
> +#endif /* _UAPI_PVMEMCONTROL_H */
> --
> 2.46.1.824.gd892dcdcdd-goog
>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux