Re: Mask bit support's API

"Yang, Sheng" <sheng.yang@xxxxxxxxx> · Fri, 3 Dec 2010 11:03:30 +0800

On Friday 03 December 2010 00:55:03 Michael S. Tsirkin wrote:
> On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote:
> > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> > >> >>  Which case?  the readl() doesn't need access to the routing table,
> > >> >>  just the entry.
> > >> >
> > >> >One thing that read should do is flush in the outstanding
> > >> >interrupts and flush out the mask bit writes.
> > >> 
> > >> The mask bit writes are synchronous.
> > >> 
> > >> wrt interrupts, we can deal with assigned devices, and can poll
> > >> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> > >> don't think it's required).
> > > 
> > > To clarify:
> > > 
> > >        mask write
> > >        read
> > > 
> > > it is safe for guest to assume no more interrupts
> > > 
> > > where as with a simple
> > >        mask write
> > > 
> > > an interrupt might be in flight and get delivered shortly afterwards.
> > 
> > I think it's already contained in the current patchset.
> > 
> > >> >>  Oh, I think there is a terminology problem, I was talking about
> > >> >>  kvm's irq routing table, you were talking about the msix entries.
> > >> >> 
> > >> >>  I think treating it as a cache causes more problems, since there
> > >> >> are now two paths for reads (in cache and not in cache) and more
> > >> >> things for writes to manage.
> > >> >> 
> > >> >>  Here's my proposed API:
> > >> >> 
> > >> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> > >> >>  pending_bitmap_base_gpa)
> > >> >> 
> > >> >>   - called when the guest enables msix
> > >> >
> > >> >I would add virtual addresses so that we can use swappable memory to
> > >> >store the state.
> > >> 
> > >> Right.
> > 
> > Do we need synchronization between kernel and userspace? Any recommended
> > method?
> > 
> > >> >If we do, maybe we can just keep the table there and then
> > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> > >> 
> > >> Still need to to let userspace know it needs to reprogram the irqfd
> > >> or whatever it uses to inject the interrupt.
> > > 
> > > Why do we need to reprogram irqfd?  I thought irqfd would map to an
> > > entry within the table instead of address/data as now.
> > > Could you clarify please?
> > > 
> > >> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> > >> >> 
> > >> >>    - called when the guest disables msix
> > >> >> 
> > >> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> > >> >> 
> > >> >>    - called when the guest enables msix (to initialize it), or
> > >> >> after live migration
> > >> >
> > >> >What is entry_id here?
> > >> 
> > >> Entry within the table.
> > > 
> > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> > > enabled (note: it can not be called at boot anyway since pa
> > > depends on BAR assigned by BIOS).
> > 
> > Don't agree. MMIO can be write regardless of if MSIX is enabled. If
> > you handle MMIO to kernel, them handle them all.
> 
> Hmm, does this mean we would need ioctls to enable/disable MSIX?  I
> think passing control from userspace to kernel when MSIX is enabled is
> not too bad. Will think about it.

Anyway we need ioctls to do so because kernel knows nothing about PCI configuration 
space... And we already have that for assigned device.
> 
> > I suppose qemu still
> > got control of BAR?
> 
> Yes. So e.g. if memory is disabled in the device then we must
> disable this as well, and not claim these transactions
> as mmio.
> 
> > Then leave it in the current place should be fine.
> 
> current place?

assigned_dev_iomem_map().
> 
> At the moment qemu does not notify devices when
> memory is disabled, only when it is enabled.

You mean bit 2 of Command register? I think suppose device would only disable 
memory when shut down should be acceptable. Also we can hook at disable point as 
well.

> So handling msix enable/disable is more straight-forward.

Don't agree. Then you got duplicate between kernel and userspace. Also the 
semantic of MSI-X MMIO has no relationship with MSIX enable/disable.
> 
> > >> >>  Michael?  I think that should work for virtio and vfio assigned
> > >> >>  devices?  Not sure about pending bits.
> > >> >
> > >> >Pending bits must be tracked in kernel, but I don't see
> > >> >how we can support polling mode if we don't exit to userspace
> > >> >on pending bit reads.
> > >> >
> > >> >This does mean that some reads will be fast and some will be
> > >> >slow, and it's a bit sad that we seem to be optimizing
> > >> >for specific guests, but I just can't come up with
> > >> >anything better.
> > >> 
> > >> If the pending bits live in userspace memory, the device model can
> > >> update them directly?
> > > 
> > > Note that these are updated on an interrupt, so updating them
> > > in userspace would need get_user_page etc trickery,
> > > and add the overhead of atomics.
> > > 
> > > Further I think it's important to avoid the overhead of updating them
> > > all the time, and only do this when an interrupt is
> > > masked or on pending bits read. Since userspace does not know
> > > when interrupts are masked, this means do update on each read.
> > 
> > In fact qemu's accessing to MMIO should be quite rare after moving all
> > the things to the kernel. Using IOCTL is also fine with me.
> > 
> > And how to "do update on each read"?
> 
> When read of pending bits is detected, we could forward it up to qemu.
> Qemu could check internal device state and clear bits that are no longer
> relevant.

Don't understand why we need turn to qemu when everything is ready in kernel. And 
pending bit is still in the scope of PCI spec. Also it's can only be written by 
the kernel who emulate the table. Of course, device model can read from it.

--
regards
Yang, Sheng

> 
> > >> >So maybe just add an ioctl to get and to clear pending bits.
> > >> >Maybe set for symmetry.
> > >> 
> > >> For live migration too.  But if they live in memory, no need for
> > >> get/set, just specify the address.
> > >> 
> > >> --
> > >> error compiling committee.c: too many arguments to function
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html