On Thu, Dec 30, 2010 at 04:24:10PM +0800, Sheng Yang wrote: > On Thursday 30 December 2010 16:15:32 Michael S. Tsirkin wrote: > > On Thu, Dec 30, 2010 at 03:55:10PM +0800, Sheng Yang wrote: > > > On Thursday 30 December 2010 15:47:48 Michael S. Tsirkin wrote: > > > > On Thu, Dec 30, 2010 at 03:32:42PM +0800, Sheng Yang wrote: > > > > > On Wednesday 29 December 2010 17:28:24 Michael S. Tsirkin wrote: > > > > > > On Wed, Dec 29, 2010 at 04:55:19PM +0800, Sheng Yang wrote: > > > > > > > On Wednesday 29 December 2010 16:31:35 Michael S. Tsirkin wrote: > > > > > > > > On Wed, Dec 29, 2010 at 03:18:13PM +0800, Sheng Yang wrote: > > > > > > > > > On Tuesday 28 December 2010 20:26:13 Avi Kivity wrote: > > > > > > > > > > On 12/22/2010 10:44 AM, Sheng Yang wrote: > > > > > > > > > > > Then we can support mask bit operation of assigned > > > > > > > > > > > devices now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @@ -3817,14 +3819,16 @@ static int > > > > > > > > > > > emulator_write_emulated_onepage(unsigned long addr, > > > > > > > > > > > > > > > > > > > > > > mmio: > > > > > > > > > > > trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 > > > > > > > > > > > *)val); > > > > > > > > > > > > > > > > > > > > > > + r = vcpu_mmio_write(vcpu, gpa, bytes, val); > > > > > > > > > > > > > > > > > > > > > > /* > > > > > > > > > > > > > > > > > > > > > > * Is this MMIO handled locally? > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > > > > > > - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) > > > > > > > > > > > + if (!r) > > > > > > > > > > > > > > > > > > > > > > return X86EMUL_CONTINUE; > > > > > > > > > > > > > > > > > > > > > > vcpu->mmio_needed = 1; > > > > > > > > > > > > > > > > > > > > > > - vcpu->run->exit_reason = KVM_EXIT_MMIO; > > > > > > > > > > > + vcpu->run->exit_reason = (r == -ENOTSYNC) ? > > > > > > > > > > > + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO; > > > > > > > > > > > > > > > > > > > > This isn't very pretty, exit_reason should be written in > > > > > > > > > > vcpu_mmio_write(). I guess we can refactor it later. > > > > > > > > > > > > > > > > > > Sure. > > > > > > > > > > > > > > > > > > > > +#define KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV (1<< 0) > > > > > > > > > > > + > > > > > > > > > > > +#define KVM_MSIX_MMIO_TYPE_BASE_TABLE (1<< 8) > > > > > > > > > > > +#define KVM_MSIX_MMIO_TYPE_BASE_PBA (1<< 9) > > > > > > > > > > > + > > > > > > > > > > > +#define KVM_MSIX_MMIO_TYPE_DEV_MASK 0x00ff > > > > > > > > > > > +#define KVM_MSIX_MMIO_TYPE_BASE_MASK 0xff00 > > > > > > > > > > > > > > > > > > > > Any explanation of these? > > > > > > > > > > > > > > > > > > I chose to use assigned device id instead of one specific > > > > > > > > > table id, because every device should got at most one MSI > > > > > > > > > MMIO(the same should applied to vfio device as well), and if > > > > > > > > > we use specific table ID, we need way to associate with the > > > > > > > > > device anyway, to perform mask/unmask or other operation. So > > > > > > > > > I think it's better to use device ID here directly. > > > > > > > > > > > > > > > > Table id will be needed to make things work for emulated > > > > > > > > devices. > > > > > > > > > > > > > > I suppose even emulated device should got some kind of id(BDF)? > > > > > > > > > > > > Not that I know. Look at how irqfd is defined for example, > > > > > > or how interrupts are sent through a gsi. > > > > > > I would like to make the interface be able to support that. > > > > > > > > > > > > > I think that is > > > > > > > enough for identification, which is already there, so we don't > > > > > > > need to allocate another ID for the device - because one device > > > > > > > would got at most one MSI-X MMIO, then use BDF or other device > > > > > > > specific ID should be quite straightforward. > > > > > > > > > > > > So you propose allocating ids for emulated devices? > > > > > > OK. How will we map e.g. irqfds to these? > > > > > > > > > > I don't understand. I've checked virtio-pci.c which is using irqfd, > > > > > and it's still a PCI device, and still have BDF, right? > > > > > > > > > > Also, what we want is a way to identify the MSI-X MMIO. For assigned > > > > > device, we use BDF, then we can easily identify the MMIO as well as > > > > > the device. For others, even they don't have BDF(I don't think so, > > > > > because MSI-X is a part of PCI, and every PCI device has BDF), what > > > > > you need is an ID, no matter what it is and how it defined. QEmu can > > > > > get the allocation done, and the type field in this API can still > > > > > tell which kind of ID/devices they are, then determine how to deal > > > > > with them. > > > > > > > > Yes, the PCI device can be identified with e.g. BFD > > > > (won't work for multi-domain but then we can write an allocator maybe). > > > > But how will we inject these interrupts? > > > > We can do this now with GSI ioctl or map GSI to irqfd > > > > and inject with irqfd write. > > > > > > I suppose it's not in the scope of this patch... > > > > This is why I suggested mapping GSI to msix. > > > > > But I think you can still do > > > this, everything is the same as before. QEmu can read from table to get > > > data/address pair, then program the routing table, etc. > > > > Yes, fine, but mask is the problem :) > > When qemu/irqfd injects an interrupt and it's masked, > > guest should not be interrupted. > > I think this should be done by other APIs(to map GSI with MSI-X entry). And of > course you can introduce one GSI type which require one ID and one entry number to > eject later, but it's not in the scope of this patch. What you propose is adding an API to map GSI to a table ID + entry number? Like this? 1. map table ID to address+length 2. map GSI to table ID + entry # But if we have such an API we don't need anything else as there's already an API to map assigned device interrupts to GSIs? > -- > regards > Yang, Sheng > > > > > > -- > > > regards > > > Yang, Sheng > > > > > > > > > > > My idea was this: we have the device id in > > > > > > > > kvm_assigned_msix_entry already. Just put table id and entry > > > > > > > > number in > > > > > > > > kvm_irq_routing_entry (create a new gsi type for this). > > > > > > > > The result will also work for irqfd because these are mapped to > > > > > > > > gsi. > > > > > > > > > > > > > > > > > And for the table and pba address, it's due to the mapping in > > > > > > > > > userspace may know the guest MSI-X table address and PBA > > > > > > > > > address at different time(due to different BAR, refer to the > > > > > > > > > code in assigned_dev_iomem_map() of qemu). So I purposed > > > > > > > > > this API to allow each of them can be passed to kernel space > > > > > > > > > individually. > > > > > > > > > > > > > > > > > > > > +struct kvm_msix_mmio_user { > > > > > > > > > > > + __u32 dev_id; > > > > > > > > > > > + __u16 type; > > > > > > > > > > > + __u16 max_entries_nr; > > > > > > > > > > > + __u64 base_addr; > > > > > > > > > > > + __u64 base_va; > > > > > > > > > > > + __u64 flags; > > > > > > > > > > > + __u64 reserved[4]; > > > > > > > > > > > +}; > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +int kvm_assigned_device_update_msix_mask_bit(struct kvm > > > > > > > > > > > *kvm, + int assigned_dev_id, int entry, u32 > flag) > > > > > > > > > > > +{ > > > > > > > > > > > > > > > > > > > > Need a better name for 'flag' (and make it a bool). > > > > > > > > > > > > > > > > > > > > > + int r = -EFAULT; > > > > > > > > > > > + struct kvm_assigned_dev_kernel *adev; > > > > > > > > > > > + int i; > > > > > > > > > > > + > > > > > > > > > > > + if (!irqchip_in_kernel(kvm)) > > > > > > > > > > > + return r; > > > > > > > > > > > + > > > > > > > > > > > + mutex_lock(&kvm->lock); > > > > > > > > > > > + adev = > > > > > > > > > > > kvm_find_assigned_dev(&kvm->arch.assigned_dev_head, > > > > > > > > > > > + assigned_dev_id); > > > > > > > > > > > + if (!adev) > > > > > > > > > > > + goto out; > > > > > > > > > > > + > > > > > > > > > > > + for (i = 0; i< adev->entries_nr; i++) > > > > > > > > > > > + if (adev->host_msix_entries[i].entry == entry) { > > > > > > > > > > > + if (flag) > > > > > > > > > > > + disable_irq_nosync( > > > > > > > > > > > + adev->host_msix_entries[i].vector); > > > > > > > > > > > + else > > > > > > > > > > > + enable_irq(adev- > >host_msix_entries[i].vector); > > > > > > > > > > > + r = 0; > > > > > > > > > > > + break; > > > > > > > > > > > + } > > > > > > > > > > > +out: > > > > > > > > > > > + mutex_unlock(&kvm->lock); > > > > > > > > > > > + return r; > > > > > > > > > > > +} > > > > > > > > > > > > > > > > > > > > > > @@ -1988,6 +2008,12 @@ static int > > > > > > > > > > > kvm_dev_ioctl_create_vm(void) > > > > > > > > > > > > > > > > > > > > > > return r; > > > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > #endif > > > > > > > > > > > > > > > > > > > > > > + r = kvm_register_msix_mmio_dev(kvm); > > > > > > > > > > > + if (r< 0) { > > > > > > > > > > > + kvm_put_kvm(kvm); > > > > > > > > > > > + return r; > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > Shouldn't this be part of individual KVM_REGISTER_MSIX_MMIO > > > > > > > > > > calls? > > > > > > > > > > > > > > > > > > In fact this MMIO device is more like global one for the VM, > > > > > > > > > not for every devices. It should handle all MMIO from all > > > > > > > > > MSI-X enabled devices, so I put it in the VM init/destroy > > > > > > > > > process. > > > > > > > > > > > > > > > > > > > > +static int msix_table_mmio_read(struct kvm_io_device > > > > > > > > > > > *this, gpa_t addr, int len, + void *val) > > > > > > > > > > > +{ > > > > > > > > > > > + struct kvm_msix_mmio_dev *mmio_dev = > > > > > > > > > > > + container_of(this, struct kvm_msix_mmio_dev, > > > > > > > > > > > table_dev); + struct kvm_msix_mmio *mmio; > > > > > > > > > > > + int idx, ret = 0, entry, offset, r; > > > > > > > > > > > + > > > > > > > > > > > + mutex_lock(&mmio_dev->lock); > > > > > > > > > > > + idx = get_mmio_table_index(mmio_dev, addr, len); > > > > > > > > > > > + if (idx< 0) { > > > > > > > > > > > + ret = -EOPNOTSUPP; > > > > > > > > > > > + goto out; > > > > > > > > > > > + } > > > > > > > > > > > + if ((addr& 0x3) || (len != 4&& len != 8)) > > > > > > > > > > > + goto out; > > > > > > > > > > > > > > > > > > > > What about (addr & 4) && (len == 8)? Is it supported? It > > > > > > > > > > may cross entry boundaries. > > > > > > > > > > > > > > > > > > Should not supported. But I haven't found words on the PCI > > > > > > > > > spec for it. So I didn't add this check. > > > > > > > > > > > > > > > > IMPLEMENTATION NOTE > > > > > > > > MSI-X Memory Space Structures in Read/Write Memory > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > For all accesses to MSI-X Table and MSI-X PBA fields, software > > > > > > > > must use aligned full > > > > > > > > DWORD or aligned full QWORD transactions; otherwise, the result > > > > > > > > is undefined. > > > > > > > > > > > > > > Yes, this one is enough, I would add the checking. > > > > > > > > > > > > > > > > > > + mmio =&mmio_dev->mmio[idx]; > > > > > > > > > > > + > > > > > > > > > > > + entry = (addr - mmio->table_base_addr) / > > > > > > > > > > > PCI_MSIX_ENTRY_SIZE; + offset = addr& 0xf; > > > > > > > > > > > + r = copy_from_user(val, (void *)(mmio->table_base_va + > > > > > > > > > > > + entry * PCI_MSIX_ENTRY_SIZE + offset), len); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + if (r) > > > > > > > > > > > + goto out; > > > > > > > > > > > +out: > > > > > > > > > > > + mutex_unlock(&mmio_dev->lock); > > > > > > > > > > > + return ret; > > > > > > > > > > > +} > > > > > > > > > > > + > > > > > > > > > > > +static int msix_table_mmio_write(struct kvm_io_device > > > > > > > > > > > *this, gpa_t addr, + int len, const void *val) > > > > > > > > > > > +{ > > > > > > > > > > > + struct kvm_msix_mmio_dev *mmio_dev = > > > > > > > > > > > + container_of(this, struct kvm_msix_mmio_dev, > > > > > > > > > > > table_dev); + struct kvm_msix_mmio *mmio; > > > > > > > > > > > + int idx, entry, offset, ret = 0, r = 0; > > > > > > > > > > > + gpa_t entry_base; > > > > > > > > > > > + u32 old_ctrl, new_ctrl; > > > > > > > > > > > + > > > > > > > > > > > + mutex_lock(&mmio_dev->lock); > > > > > > > > > > > + idx = get_mmio_table_index(mmio_dev, addr, len); > > > > > > > > > > > + if (idx< 0) { > > > > > > > > > > > + ret = -EOPNOTSUPP; > > > > > > > > > > > + goto out; > > > > > > > > > > > + } > > > > > > > > > > > + if ((addr& 0x3) || (len != 4&& len != 8)) > > > > > > > > > > > + goto out; > > > > > > > > > > > + mmio =&mmio_dev->mmio[idx]; > > > > > > > > > > > + entry = (addr - mmio->table_base_addr) / > > > > > > > > > > > PCI_MSIX_ENTRY_SIZE; + entry_base = mmio->table_base_va + > > > > > > > > > > > entry * PCI_MSIX_ENTRY_SIZE; + offset = addr& 0xF; > > > > > > > > > > > + > > > > > > > > > > > + if (copy_from_user(&old_ctrl, > > > > > > > > > > > + entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL, > > > > > > > > > > > + sizeof old_ctrl)) > > > > > > > > > > > + goto out; > > > > > > > > > > > > > > > > > > > > get_user() is easier. > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > + /* No allow writing to other fields when entry is > > > > > > > > > > > unmasked */ + if (!(old_ctrl& > > > > > > > > > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)&& + offset != > > > > > > > > > > > PCI_MSIX_ENTRY_VECTOR_CTRL) > > > > > > > > > > > + goto out; > > > > > > > > > > > + > > > > > > > > > > > + if (copy_to_user(entry_base + offset, val, len)) > > > > > > > > > > > + goto out; > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > + if (copy_from_user(&new_ctrl, > > > > > > > > > > > + entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL, > > > > > > > > > > > + sizeof new_ctrl)) > > > > > > > > > > > + goto out; > > > > > > > > > > > > > > > > > > > > put_user() > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > + if ((offset< PCI_MSIX_ENTRY_VECTOR_CTRL&& len == 4) > > > > > > > > > > > || + (offset< PCI_MSIX_ENTRY_DATA&& len == 8)) > > > > > > > > > > > + ret = -ENOTSYNC; > > > > > > > > > > > + if (old_ctrl == new_ctrl) > > > > > > > > > > > + goto out; > > > > > > > > > > > + if (!(old_ctrl& PCI_MSIX_ENTRY_CTRL_MASKBIT)&& > > > > > > > > > > > + (new_ctrl& PCI_MSIX_ENTRY_CTRL_MASKBIT)) > > > > > > > > > > > + r = update_msix_mask_bit(mmio_dev->kvm, mmio, > entry, > > > > > > > > > > > 1); + else if ((old_ctrl& > > > > > > > > > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)&& + !(new_ctrl& > > > > > > > > > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)) > > > > > > > > > > > + r = update_msix_mask_bit(mmio_dev->kvm, mmio, > entry, > > > > > > > > > > > 0); + if (r || ret) > > > > > > > > > > > + ret = -ENOTSYNC; > > > > > > > > > > > +out: > > > > > > > > > > > + mutex_unlock(&mmio_dev->lock); > > > > > > > > > > > + return ret; > > > > > > > > > > > +} > > > > > > > > > > > > > > > > > > > > blank line... > > > > > > > > > > > > > > > > > > > > > +static const struct kvm_io_device_ops > > > > > > > > > > > msix_mmio_table_ops = { + .read = > > > > > > > > > > > msix_table_mmio_read, > > > > > > > > > > > + .write = msix_table_mmio_write, > > > > > > > > > > > +}; > > > > > > > > > > > + > > > > > > > > > > > ++ > > > > > > > > > > > +int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm, > > > > > > > > > > > + struct kvm_msix_mmio_user *mmio_user) > > > > > > > > > > > +{ > > > > > > > > > > > + struct kvm_msix_mmio_dev *mmio_dev > > > > > > > > > > > =&kvm->msix_mmio_dev; + struct kvm_msix_mmio *mmio = > > > > > > > > > > > NULL; > > > > > > > > > > > + int r = 0, i; > > > > > > > > > > > + > > > > > > > > > > > + mutex_lock(&mmio_dev->lock); > > > > > > > > > > > + for (i = 0; i< mmio_dev->mmio_nr; i++) { > > > > > > > > > > > + if (mmio_dev->mmio[i].dev_id == mmio_user->dev_id&& > > > > > > > > > > > + (mmio_dev->mmio[i].type& > > > > > > KVM_MSIX_MMIO_TYPE_DEV_MASK) > > > > > > > > == > > > > > > > > > > > > > > > > + (mmio_user->type& > KVM_MSIX_MMIO_TYPE_DEV_MASK)) { > > > > > > > > > > > + mmio =&mmio_dev->mmio[i]; > > > > > > > > > > > + if (mmio->max_entries_nr != mmio_user- > > > > > > > >max_entries_nr) > > > > > > > > > { > > > > > > > > > > > > > > > > + r = -EINVAL; > > > > > > > > > > > + goto out; > > > > > > > > > > > + } > > > > > > > > > > > + break; > > > > > > > > > > > + } > > > > > > > > > > > + } > > > > > > > > > > > + if (!mmio) { > > > > > > > > > > > + if (mmio_dev->mmio_nr == KVM_MSIX_MMIO_MAX) { > > > > > > > > > > > + r = -ENOSPC; > > > > > > > > > > > + goto out; > > > > > > > > > > > + } > > > > > > > > > > > + mmio =&mmio_dev->mmio[mmio_dev->mmio_nr]; > > > > > > > > > > > + mmio_dev->mmio_nr++; > > > > > > > > > > > + } > > > > > > > > > > > + mmio->max_entries_nr = mmio_user->max_entries_nr; > > > > > > > > > > > > > > > > > > > > Sanity check to avoid overflow. > > > > > > > > > > > > > > > > > > > > > + mmio->dev_id = mmio_user->dev_id; > > > > > > > > > > > + mmio->flags = mmio_user->flags; > > > > > > > > > > > > > > > > > > > > Check for unsupported bits (all of them at present?) > > > > > > > > > > > > > > > > > > > > > + if ((mmio_user->type& KVM_MSIX_MMIO_TYPE_DEV_MASK) == > > > > > > > > > > > + KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV) > > > > > > > > > > > + mmio->type = KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV; > > > > > > > > > > > + > > > > > > > > > > > + if ((mmio_user->type& KVM_MSIX_MMIO_TYPE_BASE_MASK) == > > > > > > > > > > > + KVM_MSIX_MMIO_TYPE_BASE_TABLE) { > > > > > > > > > > > + mmio->type |= KVM_MSIX_MMIO_TYPE_BASE_TABLE; > > > > > > > > > > > + mmio->table_base_addr = mmio_user->base_addr; > > > > > > > > > > > + mmio->table_base_va = mmio_user->base_va; > > > > > > > > > > > > > > > > > > > > Check for va in kernel space. > > > > > > > > > > > > > > > > > > > > > + } else if ((mmio_user->type& > > > > > > > > > > > KVM_MSIX_MMIO_TYPE_BASE_MASK) == > > > > > > > > > > > + KVM_MSIX_MMIO_TYPE_BASE_PBA) { > > > > > > > > > > > + mmio->type |= KVM_MSIX_MMIO_TYPE_BASE_PBA; > > > > > > > > > > > + mmio->pba_base_addr = mmio_user->base_addr; > > > > > > > > > > > + mmio->pba_base_va = mmio_user->base_va; > > > > > > > > > > > + } > > > > > > > > > > > +out: > > > > > > > > > > > + mutex_unlock(&mmio_dev->lock); > > > > > > > > > > > + return r; > > > > > > > > > > > +} > > > > > > > > > > > + > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > In all, looks reasonable. I'd like to see documentation > > > > > > > > > > for it, and review from the pci people. Alex, mst? > > > > > > > > > > > > > > > > Some general comments: > > > > > > > > PBA isn't supported in this version, which is OK, but let's not > > > > > > > > add a capability until it is, and let's not try to guess what > > > > > > > > the interface will look like. I think keeping PBA in userspace > > > > > > > > will be hard because it needs to be modified from interrupt > > > > > > > > context. Removing the PBA stub will make the interface > > > > > > > > simpler. > > > > > > > > > > > > > > The API only get the PBA address now which should be fine. And we > > > > > > > still have threaded irq and tasklet for accessing the userspace > > > > > > > for interrupt handler... > > > > > > > > > > > > I don't think it's going to work: we are not > > > > > > in the context of the right process. Further > > > > > > I think we should keep the option of > > > > > > reading the PBA status from the device or host kernel open. > > > > > > And generally having an interface > > > > > > for functionality we don't implement is not a good idea: > > > > > > you don't know whether you really can support the interface you > > > > > > promised. > > > > > > > > > > Well, I don't know if we want to read PBA from device directly. To me > > > > > it's not a good idea because the real device has nothing to do with > > > > > the one we show to the guest. At least direct accessing the mask > > > > > bits of real device would be very dangerous. Avi? > > > > > > > > > > -- > > > > > regards > > > > > Yang, Sheng > > > > > > > > I am not really suggesting this. What I say is PBA is unimplemented > > > > let us not commit to an interface yet. > > > > > > > > > > > -- > > > > > > > regards > > > > > > > Yang, Sheng > > > > > > > > > > > > > > > > Would add the API document soon. > > > > > > > > > > > > > > > > > > -- > > > > > > > > > regards > > > > > > > > > Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html