Re: RFC: Creating mediated devices with libvirt

John Ferlan <jferlan@xxxxxxxxxx> · Thu, 22 Jun 2017 17:57:34 -0400

On 06/14/2017 06:06 PM, Erik Skultety wrote:
> Hi all,
> 
> so there's been an off-list discussion about finally implementing creation of
> mediated devices with libvirt and it's more than desired to get as many opinions
> on that as possible, so please do share your ideas. This did come up already as
> part of some older threads ([1] for example), so this will be a respin of the
> discussions. Long story short, we decided to put device creation off and focus
> on the introduction of the framework as such first and build upon that later,
> i.e. now.
> 
> [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
> 
> ========================================
> PART 1: NODEDEV-DRIVER
> ========================================
> 
> API-wise, device creation through the nodedev driver should be pretty
> straightforward and without any issues, since virNodeDevCreateXML takes an XML
> and does support flags. Looking at the current device XML:
> 
> <device>
>   <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name>
>   <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path>
>   <parent>pci_0000_03_00_0</parent>
>   <driver>
>     <name>vfio_mdev</name>
>   </driver>
>   <capability type='mdev'>
>     <type id='nvidia-11'/>
>     <iommuGroup number='13'/>
>     <uuid>UUID<uuid> <!-- optional enhancement, see below -->
>   </capability>
> </device>
> 
> We can ignore <path>,<driver>,<iommugroup> elements, since these are useless
> during creation. We also cannot use <name> since we don't support arbitrary
> names and we also can't rely on users providing a name in correct form which we
> would need to further parse in order to get the UUID.
> So since the only thing missing to successfully use create an mdev using XML is
> the UUID (if user doesn't want it to be generated automatically), how about
> having a <uuid> subelement under <capability> just like PCIs have <domain> and
> friends, USBs have <bus> & <device>, interfaces have <address> to uniquely
> identify the device even if the name itself is unique.
> Removal of a device should work as well, although we might want to
> consider creating a *Flags version of the API.

Has any thought been put towards creating an mdev pool modeled after the
Storage Pool? Similar to how vHBA's are created from a Storage Pool XML
definition.

That way XML could be defined to keep track of a lot of different things
that you may need and would require only starting the pool in order to
access.

Placed "appropriately" - the mdev's could already be available by the
time node device state initialization occurs too since the pool would
conceivably been created/defined using data from the physical device and
the calls to create the virtual devices would have occurred. Much easier
to add logic to a new driver/pool mgmt to handle whatever considerations
there are than adding logic into the existing node device driver.

Of course if there's only ever going to be a 1-to-1 relationship between
whatever the mdev parent is and an mdev child, then it's probably
overkill to go with a pool model; however, I was under the impression
that an mdev parent could have many mdev children with various different
configuration options depending on multiple factors.

Thus:

<gpu_pool type='mdev'>
  <name>Happy</name>
  <uuid>UUID</uuid>
  <source>
    <parent uuid='0cce8709-0640-46ef-bd14-962c7f73cc6f'/>
    ...
  </source>
...
</gpu_pool>

where the parent is then "found" in node device via "mdev_%s", <parent
uuid..." value.

One could then create (ahem) <vgpu> XML that would define specific
"formats" that could be used and made active/inactive. A bit different
than <volume> XML which is output only based on what's found in the
storage pool source.

My recollection of the whole frame work is not up to par with the latest
information, but I recall there being multiple different ways to have
"something" defined that could then be used by the guest based on one
parent mdev. What those things are were a combination of what the mdev
could support and there could be 1 or many depending on the resultant vGPU.

Maybe we need a virtual white board to help describe the things ;-)

If you wait long enough or perhaps if review pace would pick up, maybe
creating a new driver and vir*obj infrastructure will be easier with a
common virObject instance. Oh and this has a "uuid" and "name" for
searches, so fits nicely.

> 
> =============================================================
> PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED!
> =============================================================
> 
> There were some doubts about auto-creation mentioned in [1], although they
> weren't specified further. So hopefully, we'll get further in the discussion
> this time.
> 
>>From my perspective there are two main reasons/benefits to that:
> 
> 1) Convenience
> For apps like virt-manager, user will want to add a host device transparently,
> "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for
> higher management apps, like oVirt, even they might not care about the parent
> device at all times and considering that they would need to enumerate the
> parents, pick one, create the device XML and pass it to the nodedev driver, IMHO
> it would actually be easier and faster to just do it directly through sysfs,
> bypassing libvirt once again....

Using "pool" methodology borrows on existing storage technology except
applying it to "gpu_pool" - a pool of vGPU's would be like a storage
pool of volumes. Picking out a volume from a list would seem to be a
mostly simple exercise. Especially if the XML/data for the vGPU can be
queried to return something specific.  My domain needs a XXX type vGPU -
please find or create for me.

> 
> 2) Future domain migration
> Suppose now that the mdev backing physical devices support state dump and
> reload. Chances are, that the corresponding mdev doesn't even exist or has a
> different UUID on the destination, so libvirt would do its best to handle this
> before the domain could be resumed.
> Following what we already have:
> 
> <devices>
>   <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
>   <source>
>     <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
>   </source>
>   </hostdev>
> </devices>
> 

I guess it's not clear which UUID is that for/from?  Is this the one you
were considering in the <capability>?  Or in my terminology the child of
the parent from about with UUID=0cce8709-0640-46ef-bd14-962c7f73cc6f.

> Instead of trying to somehow extend the <address> element using more
> attributes like 'domain', 'slot', 'function', etc. that would render the whole
> element ambiguous, I was thinking about creating a <parent> element nested under
> <source> that would be basically just a nested definition of another host device
> re-using all the element we already know, i.e. <address> for PCI, and of course
> others if there happens to be a need for devices other than PCI. So speaking
> about XML, we'd end up with something like:
> 
> <devices>
>   <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
>   <source>
>     <parent>
>       <!-- possibly another <source> element - do we really want that? -->
>         <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
>         <type id='foo'/>
>       <!-- end of potential <source> element -->
>     </parent>
>     <!-- this one takes precedence if it exists, ignoring the parent -->
>     <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
>   </source>
>   </hostdev>
> </devices>

Migration makes things a bit more tricky, but from bz 1404964 which
describes some thoughts Paolo had about vHBA migration - how about a way
to somehow define multiple UUID's - primary/secondary... or just a
"list" of <parent uuid='xxx'/>'s from which "the first" found on any
given host is used. By first found I assume there's a "physical" card
with a UUID on the host which has a node device with name "mdev_%s"
(UUID w/_ instead of -).

Using a gpu_pool type XML you could ship that around rather than trying
to somehow ship across nodedev XML to define something on the migration
target.

John

Maybe I'm lost in the weeds somewhere too ;-)

> 
> So, this was the first idea off the top of my head, so I'd appreciate any
> suggestions, comments, especially from people who have got the 'legacy'
> insight into libvirt and can predict potential pitfalls based on experience :).
> 
> Thanks,
> Erik
> 
> --
> libvir-list mailing list
> libvir-list@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/libvir-list
> 

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list