Thanks Alex. On 6/22/2016 4:18 AM, Alex Williamson wrote: > On Mon, 20 Jun 2016 22:01:47 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >> + >> +static int get_mdev_region_info(struct mdev_device *mdev, >> + struct pci_region_info *vfio_region_info, >> + int index) >> +{ >> + int ret = -EINVAL; >> + struct parent_device *parent = mdev->parent; >> + >> + if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) { >> + mutex_lock(&mdev->ops_lock); >> + ret = parent->ops->get_region_info(mdev, index, >> + vfio_region_info); >> + mutex_unlock(&mdev->ops_lock); > > Why do we have two ops_lock, one on the parent_device and one on the > mdev_device?! Is this one actually locking anything or also just > providing serialization? Why do some things get serialized at the > parent level and some things at the device level? Very confused by > ops_lock. > There are two sets of callback: * parent device callbacks: supported_config, create, destroy, start, stop * mdev device callbacks: read, write, set_irqs, get_region_info, validate_map_request parent->ops_lock is to serialize per parent device callbacks. mdev->ops_lock is to serialize per mdev device callbacks. I'll add above comment in mdev.h. >> + >> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type) >> +{ >> + /* Don't support MSIX for now */ >> + if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) >> + return -1; >> + >> + return 1; > > Too much hard coding here, the mediated driver should define this. > I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we thought we can add supported for MSIX later. Till then hard code it to 1. >> + >> + if (parent && parent->ops->set_irqs) { >> + mutex_lock(&mdev->ops_lock); >> + ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index, >> + hdr.start, hdr.count, data); >> + mutex_unlock(&mdev->ops_lock); > > Device level serialization on set_irqs... interesting. > Hope answer above helps to clarify this. >> + } >> + >> + kfree(ptr); >> + return ret; >> + } >> + } >> + return -ENOTTY; >> +} >> + >> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf, >> + size_t count, loff_t *ppos, bool iswrite) >> +{ >> + struct mdev_device *mdev = vmdev->mdev; >> + struct parent_device *parent = mdev->parent; >> + int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size; >> + int ret = 0; >> + uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK; >> + >> + if (pos < 0 || pos >= size || >> + pos + count > size) { >> + pr_err("%s pos 0x%llx out of range\n", __func__, pos); >> + ret = -EFAULT; >> + goto config_rw_exit; >> + } >> + >> + if (iswrite) { >> + char *usr_data, *ptr; >> + >> + ptr = usr_data = memdup_user(buf, count); >> + if (IS_ERR(usr_data)) { >> + ret = PTR_ERR(usr_data); >> + goto config_rw_exit; >> + } >> + >> + ret = parent->ops->write(mdev, usr_data, count, >> + EMUL_CONFIG_SPACE, pos); > > No serialization on this ops, thank goodness, but why? > Its there at caller of mdev_dev_rw(). > This read/write interface still seems strange to me... > Replied on this in 1st Patch. >> + >> + memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count); >> + kfree(ptr); >> + } else { >> + char *ret_data, *ptr; >> + >> + ptr = ret_data = kzalloc(count, GFP_KERNEL); >> + >> + if (IS_ERR(ret_data)) { >> + ret = PTR_ERR(ret_data); >> + goto config_rw_exit; >> + } >> + >> + ret = parent->ops->read(mdev, ret_data, count, >> + EMUL_CONFIG_SPACE, pos); >> + >> + if (ret > 0) { >> + if (copy_to_user(buf, ret_data, ret)) >> + ret = -EFAULT; >> + else >> + memcpy((void *)(vmdev->vconfig + pos), >> + (void *)ret_data, count); >> + } >> + kfree(ptr); > > So vconfig caches all of config space for the mdev, but we only ever > use it to read the BAR address via mdev_read_base()... why? I hope the > mdev driver doesn't freak out if the user reads the mmio region before > writing a base address (remember the vfio API aspect of the interface > doesn't necessarily follow the VM PCI programming API) > How could user read mmio region from guest before writing base address? Isn't that would be a bug? >From our driver if pos is not within the base address range, then we return error for read/write. >> + } >> +config_rw_exit: >> + >> + if (ret > 0) >> + *ppos += ret; >> + >> + return ret; >> +} >> + >> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf, >> + size_t count, loff_t *ppos, bool iswrite) >> +{ >> + struct mdev_device *mdev = vmdev->mdev; >> + struct parent_device *parent = mdev->parent; >> + loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK; >> + loff_t pos; >> + int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); >> + int ret = 0; >> + >> + if (!vmdev->vfio_region_info[bar_index].start) >> + mdev_read_base(vmdev); >> + >> + if (offset >= vmdev->vfio_region_info[bar_index].size) { >> + ret = -EINVAL; >> + goto bar_rw_exit; >> + } >> + >> + count = min(count, >> + (size_t)(vmdev->vfio_region_info[bar_index].size - offset)); >> + >> + pos = vmdev->vfio_region_info[bar_index].start + offset; > > In the case of a mpci dev, @start is the vconfig BAR value, so it's > user (guest) writable, and the mediated driver is supposed to > understand that? I suppose is saw the config write too, if there was > one, but the mediated driver gives us region info based on region index. > We have the region index here. Why wouldn't we do reads and writes > based on region index and offset and eliminate vconfig? Seems like > that would consolidate a lot of this, we don't care what we're reading > and writing, just pass it through. Mediated pci drivers would simply > need to match indexes to those already defined for vfio-pci. > Ok, looking at it. so this will remove vconfig completely. Thanks, Kirti. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html