Re: [RFC PATCH 3/7] vfio: add spimdev support

Kenneth Lee <liguozhu@xxxxxxxxxxxxx> · Mon, 6 Aug 2018 09:40:04 +0800

On Thu, Aug 02, 2018 at 12:43:27PM -0600, Alex Williamson wrote:
> Date: Thu, 2 Aug 2018 12:43:27 -0600
> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> To: Cornelia Huck <cohuck@xxxxxxxxxx>
> CC: Kenneth Lee <liguozhu@xxxxxxxxxxxxx>, "Tian, Kevin"
>  <kevin.tian@xxxxxxxxx>, Kenneth Lee <nek.in.cn@xxxxxxxxx>, Jonathan Corbet
>  <corbet@xxxxxxx>, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>, "David S .
>  Miller" <davem@xxxxxxxxxxxxx>, Joerg Roedel <joro@xxxxxxxxxx>, Hao Fang
>  <fanghao11@xxxxxxxxxx>, Zhou Wang <wangzhou1@xxxxxxxxxxxxx>, Zaibo Xu
>  <xuzaibo@xxxxxxxxxx>, Philippe Ombredanne <pombredanne@xxxxxxxx>, "Greg
>  Kroah-Hartman" <gregkh@xxxxxxxxxxxxxxxxxxx>, Thomas Gleixner
>  <tglx@xxxxxxxxxxxxx>, "linux-doc@xxxxxxxxxxxxxxx"
>  <linux-doc@xxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx"
>  <linux-kernel@xxxxxxxxxxxxxxx>, "linux-crypto@xxxxxxxxxxxxxxx"
>  <linux-crypto@xxxxxxxxxxxxxxx>, "iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx"
>  <iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx>, "kvm@xxxxxxxxxxxxxxx"
>  <kvm@xxxxxxxxxxxxxxx>, "linux-accelerators@xxxxxxxxxxxxxxxx\"
>          <linux-accelerators@xxxxxxxxxxxxxxxx>, Lu Baolu
>  <baolu.lu@xxxxxxxxxxxxxxx>,  Kumar", <Sanjay K "
>  <sanjay.k.kumar@xxxxxxxxx>, " linuxarm@xxxxxxxxxx "
>  <linuxarm@xxxxxxxxxx>">
> Subject: Re: [RFC PATCH 3/7] vfio: add spimdev support
> Message-ID: <20180802124327.403b10ab@xxxxxxxxxx>
> 
> On Thu, 2 Aug 2018 10:35:28 +0200
> Cornelia Huck <cohuck@xxxxxxxxxx> wrote:
> 
> > On Thu, 2 Aug 2018 15:34:40 +0800
> > Kenneth Lee <liguozhu@xxxxxxxxxxxxx> wrote:
> > 
> > > On Thu, Aug 02, 2018 at 04:24:22AM +0000, Tian, Kevin wrote:  
> > 
> > > > > From: Kenneth Lee [mailto:liguozhu@xxxxxxxxxxxxx]
> > > > > Sent: Thursday, August 2, 2018 11:47 AM
> > > > >     
> > > > > >    
> > > > > > > From: Kenneth Lee
> > > > > > > Sent: Wednesday, August 1, 2018 6:22 PM
> > > > > > >
> > > > > > > From: Kenneth Lee <liguozhu@xxxxxxxxxxxxx>
> > > > > > >
> > > > > > > SPIMDEV is "Share Parent IOMMU Mdev". It is a vfio-mdev. But differ    
> > > > > from    
> > > > > > > the general vfio-mdev:
> > > > > > >
> > > > > > > 1. It shares its parent's IOMMU.
> > > > > > > 2. There is no hardware resource attached to the mdev is created. The
> > > > > > > hardware resource (A `queue') is allocated only when the mdev is
> > > > > > > opened.    
> > > > > >
> > > > > > Alex has concern on doing so, as pointed out in:
> > > > > >
> > > > > > 	https://www.spinics.net/lists/kvm/msg172652.html
> > > > > >
> > > > > > resource allocation should be reserved at creation time.    
> > > > > 
> > > > > Yes. That is why I keep telling that SPIMDEV is not for "VM", it is for "many
> > > > > processes", it is just an access point to the process. Not a device to VM. I
> > > > > hope
> > > > > Alex can accept it:)
> > > > >     
> > > > 
> > > > VFIO is just about assigning device resource to user space. It doesn't care
> > > > whether it's native processes or VM using the device so far. Along the direction
> > > > which you described, looks VFIO needs to support the configuration that
> > > > some mdevs are used for native process only, while others can be used
> > > > for both native and VM. I'm not sure whether there is a clean way to
> > > > enforce it...    
> > > 
> > > I had the same idea at the beginning. But finally I found that the life cycle
> > > of the virtual device for VM and process were different. Consider you create
> > > some mdevs for VM use, you will give all those mdevs to lib-virt, which
> > > distribute those mdev to VMs or containers. If the VM or container exits, the
> > > mdev is returned to the lib-virt and used for next allocation. It is the
> > > administrator who controlled every mdev's allocation.
> 
> Libvirt currently does no management of mdev devices, so I believe
> this example is fictitious.  The extent of libvirt's interaction with
> mdev is that XML may specify an mdev UUID as the source for a hostdev
> and set the permissions on the device files appropriately.  Whether
> mdevs are created in advance and re-used or created and destroyed
> around a VM instance (for example via qemu hooks scripts) is not a
> policy that libvirt imposes.
>  
> > > But for process, it is different. There is no lib-virt in control. The
> > > administrator's intension is to grant some type of application to access the
> > > hardware. The application can get a handle of the hardware, send request and get
> > > the result. That's all. He/She dose not care which mdev is allocated to that
> > > application. If it crashes, it should be the kernel's responsibility to withdraw
> > > the resource, the system administrator does not want to do it by hand.  
> 
> Libvirt is also not a required component for VM lifecycles, it's an
> optional management interface, but there are also VM lifecycles exactly
> as you describe.  A VM may want a given type of vGPU, there might be
> multiple sources of that type and any instance is fungible to any
> other.  Such an mdev can be dynamically created, assigned to the VM,
> and destroyed later.  Why do we need to support "empty" mdevs that do
> not reserve reserve resources until opened?  The concept of available
> instances is entirely lost with that approach and it creates an
> environment that's difficult to support, resources may not be available
> at the time the user attempts to access them.
>  
> > I don't think that you should distinguish the cases by the presence of
> > a management application. How can the mdev driver know what the
> > intention behind using the device is?
> 
> Absolutely, vfio is a userspace driver interface, it's not tailored to
> VM usage and we cannot know the intentions of the user.
>  
> > Would it make more sense to use a different mechanism to enforce that
> > applications only use those handles they are supposed to use? Maybe
> > cgroups? I don't think it's a good idea to push usage policy into the
> > kernel.
> 
> I agree, this sounds like a userspace problem, mdev supports dynamic
> creation and removal of mdev devices, if there's an issue with
> maintaining a set of standby devices that a user has access to, this
> sounds like a userspace broker problem.  It makes more sense to me to
> have a model where a userspace application can make a request to a
> broker and the broker can reply with "none available" rather than
> having a set of devices on standby that may or may not work depending
> on the system load and other users.  Thanks,
> 
> Alex

I am sorry, I used a wrong mutt command when reply to Cornelia's last mail. The
last reply dose not stay within this thread. So please let me repeat my point
here.

I should not have use libvirt as the example. But WarpDrive works in such
scenario:

1. It supports thousands of processes. Take zip accelerator as an example, any
application need data compression/decompression will need to interact with the
accelerator. To support that, you have to create tens of thousands of mdev for
their usage. I don't think it is a good idea to have so many devices in the
system.

2. The application does not want to own the mdev for long. It just need an
access point for the hardware service. If it has to interact with an management
agent for allocation and release, this makes the problem complex.

3. The service is bound with the process. When the process exit, the resource
should be released automatically. Kernel is the best place to monitor the state
of the process.

I agree this extending the concept of mdev. But again, it is cleaner than
creating another facility for user land DMA. We just need to take mdev as an
access point of the device: when it is open, the resource is given. It is not a
device for a particular entity or instance. But it is still a device which can
provide service of the hardware.

Cornelia is worrying about resource starving. I think that can be solved by set
restriction on the mdev itself. Mdev management agent dose not help much here.
Management on the mdev itself can still lead to the status of running out of
resource.

Thanks

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中
的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!