On 07/31/2012 01:29 AM, Alex Williamson wrote: >> >> If the region size is zero, then both memory_region_del_subregion() >> (assuming the region is parented) and munmap() do nothing. So you could >> call this unconditionally. > > I suppose parenting them is the key. I'm counting on memory_region_size > of zero for an uninitialized, g_malloc0() MemoryRegion. That's a no-no. We have APIs for a reason. Maybe I'll start encrypting the contents by xoring with a private variable. > Initializing > them just to have a parent so we can unconditionally remove them here > seems like it's just shifting complexity from one function to another. > The majority of BARs aren't even implemented, so we'd actually be > setting up a lot of dummy infrastructure for a slightly cleaner unmap > function. I'll keep looking at this, but I'm not optimistic there's an > overall simplification here. Ok. But use your own bool, don't overload an something from MemoryRegion. > >> >> > + >> >> > + if (vdev->msix && vdev->msix->table_bar == nr) { >> >> > + size = memory_region_size(&vdev->msix->mmap_mem); >> >> > + if (size) { >> >> > + memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem); >> >> > + munmap(vdev->msix->mmap, size); >> >> > + } >> >> > + } >> > >> > And this one potentially unmaps the overlap after the vector table if >> > there's any space for one. >> > >> >> Are the three size checks needed? Everything should work without them >> >> from the memory core point of view. >> > >> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing >> > NULL... no? >> >> NULL isn't the problem (well some kernels protect against mmaping NULL >> to avoid kernel exploits), but it seems the kernel doesn't like a zero >> length. > > in mm/mmap.c:do_munmap() I see: > > if ((len = PAGE_ALIGN(len)) == 0) > return -EINVAL; > > Before anything scary happens, so that should be ok. It's not really > worthwhile to call the munmaps unconditionally if we already have the > condition tests because the subregions are unparented though. Yeah. > >> >> > + >> >> > + /* >> >> > + * We can't mmap areas overlapping the MSIX vector table, so we >> >> > + * potentially insert a direct-mapped subregion before and after it. >> >> > + */ >> >> >> >> This splitting is what the memory core really enjoys. You can just >> >> place the MSIX page over the RAM page and let it do the cut-n-paste. >> > >> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security >> > reasons. It might be worthwhile to someday make VFIO insert an >> > anonymous page over the MSI-X table to allow this, but it didn't look >> > trivial for my novice mm abilities. Easy to add a flag from the VFIO >> > kernel structure where we learn about this BAR if we add it in the >> > future. >> >> I meant due it purely in qemu. Instead of an emulated region overlaid >> by two assigned regions, have an assigned region overlaid by the >> emulated region. The regions seen by the vfio listener will be the same. > > Sure, that's what KVM device assignment does, but it requires being able > to mmap the whole BAR, including an MSI-X table. The VFIO kernel side > can't assume userspace isn't malicious so it has to prevent this. I wonder whether it should prevent the mmap(), or let it though and just SIGBUS on accesses. >> > >> > This is actually kind of complicated. Opening /dev/vfio/vfio gives us >> > an instance of a container in the kernel. A group can only be attached >> > to one container. So whoever calls us with passed fds needs to track >> > this very carefully. This is also why I've dropped any kind of shared >> > IOMMU option to give us a hint whether to try to cram everything in the >> > same container (~= iommu domain). It's too easy to pass conflicting >> > info to share a container for one device, but not another... yet they >> > may be in the same group. I'll work on the fd passing though and try to >> > come up with a reasonable model. >> >> I didn't really follow the container stuff so I can't comment here. But >> suppose all assigned devices are done via fd passing, isn't it >> sufficient to just pass the fd for the device (and keep the iommu group >> fd in the managment tool)? > > Nope. > > containerfd = open(/dev/vfio/vfio) > groupfd = open(/dev/vfio/$GROUPID) > devicefd = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD) > > The container provides access to the iommu, the group is the unit of > ownership and privilege, and device cannot be accessed without iommu > protection. Therefore to get to a devicefd, we first need to privilege > the container by attaching a group to it, that let's us initialize the > iommu, which allows us to get the device fd. At a minimum, we'd need > both container and device fds, which means libvirt would be responsible > for determining what type of iommu interface to initialize. Doing that > makes adding another device tenuous. It's not impossible, but VFIO is > design such that /dev/vfio/vfio is completely harmless on it's own, safe > for mode 0666 access, just like /dev/kvm. The groupfd is the important > access point, so maybe it's sufficient that libvirt could pass only that > and let qemu open /dev/vfio/vfio on it's own. The only problem then is > that libvirt needs to pass the same groupfd for each device that gets > assigned within a group. What I was thinking was that libvirt would do all the setup, including attaching the iommu, then pass something that is safe to qemu. I don't see an issue with libvirt keeping tracks of groups; libvirt is supposed to be doing the host-side management anyway. But I'm not familiar with the API so I guess it can't be done. Maybe an extension? > >> >> > + >> >> > + >> >> > +typedef struct MSIVector { >> >> > + EventNotifier interrupt; /* eventfd triggered on interrupt */ >> >> > + struct VFIODevice *vdev; /* back pointer to device */ >> >> > + int vector; /* the vector number for this element */ >> >> > + int virq; /* KVM irqchip route for Qemu bypass */ >> >> >> >> This calls for an abstraction (don't we have a cache where we look those >> >> up?) >> > >> > I haven't see one, pointer? I tried to follow vhost's lead here. >> >> See kvm_irqchip_send_msi(). But this isn't integrated with irqfd yet. > > Right, the irqfd is what we're really after. Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which creates an irqfd if available, or emulates it by adding a listener to that eventfd and injecting the interrupt (either through tcg or kvm) itself. > >> >> > + bool use; >> >> > +} MSIVector; >> >> > + >> >> > + >> >> > +typedef struct VFIOContainer { >> >> > + int fd; /* /dev/vfio/vfio, empowered by the attached groups */ >> >> > + struct { >> >> > + /* enable abstraction to support various iommu backends */ >> >> > + union { >> >> > + MemoryListener listener; /* Used by type1 iommu */ >> >> > + }; >> >> >> >> The usual was is to have a Type1VFIOContainer deriving from >> >> VFIOContainer and adding a MemoryListener. >> > >> > Yep, that would work too. It gets a bit more complicated that way >> > though because we have to know when the container is allocated what type >> > it's going to be. This way we can step though possible iommu types and >> > support the right one. Eventually there may be more than one type >> > supported on the same platform (ex. one that enables PRI). Do-able, but >> > I'm not sure it's worth it at this point. >> >> An alternative alternative is to put a pointer to an abstract type here, >> then you can defer the decision on the concrete type later. But I agree >> it's not worth it at this point. Maybe just drop the union and decide >> later when a second iommu type is added. > > A pointer doesn't allow us to use container_of to get back to the > VFIOContainer from the memory listener callback, so we'd have to create > some new struct just to hold that back pointer. Alexey's proposed POWER > support for VFIO already makes use of the union, so it seems like a > sufficient solution for now. We'll have to re-evaluate if it's getting > unwieldy after we get a few though. Ok. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html