On Mon, Aug 13, 2018 at 03:23:01PM -0400, Jerome Glisse wrote: > Received: from popscn.huawei.com [10.3.17.45] by Turing-Arch-b with POP3 > (fetchmail-6.3.26) for <kenny@localhost> (single-drop); Tue, 14 Aug 2018 > 03:30:02 +0800 (CST) > Received: from DGGEMM401-HUB.china.huawei.com (10.3.20.209) by > DGGEML402-HUB.china.huawei.com (10.3.17.38) with Microsoft SMTP Server > (TLS) id 14.3.399.0; Tue, 14 Aug 2018 03:23:25 +0800 > Received: from dggwg01-in.huawei.com (172.30.65.34) by > DGGEMM401-HUB.china.huawei.com (10.3.20.209) with Microsoft SMTP Server id > 14.3.399.0; Tue, 14 Aug 2018 03:23:21 +0800 > Received: from mx1.redhat.com (unknown [66.187.233.73]) by Forcepoint > Email with ESMTPS id 301B5D4F60895 for <liguozhu@xxxxxxxxxxxxx>; Tue, 14 > Aug 2018 03:23:16 +0800 (CST) > Received: from smtp.corp.redhat.com > (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using > TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client > certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id > 4D0A14023461; Mon, 13 Aug 2018 19:23:05 +0000 (UTC) > Received: from redhat.com (unknown [10.20.6.215]) by > smtp.corp.redhat.com (Postfix) with ESMTPS id 20D072156712; Mon, 13 Aug > 2018 19:23:03 +0000 (UTC) > Date: Mon, 13 Aug 2018 15:23:01 -0400 > From: Jerome Glisse <jglisse@xxxxxxxxxx> > To: Kenneth Lee <liguozhu@xxxxxxxxxxxxx> > CC: Kenneth Lee <nek.in.cn@xxxxxxxxx>, Jean-Philippe Brucker > <jean-philippe.brucker@xxxxxxx>, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>, > "kvm@xxxxxxxxxxxxxxx" <kvm@xxxxxxxxxxxxxxx>, Jonathan Corbet > <corbet@xxxxxxx>, Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, Zaibo > Xu <xuzaibo@xxxxxxxxxx>, "linux-doc@xxxxxxxxxxxxxxx" > <linux-doc@xxxxxxxxxxxxxxx>, "Kumar, Sanjay K" <sanjay.k.kumar@xxxxxxxxx>, > "Tian, Kevin" <kevin.tian@xxxxxxxxx>, "iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx" > <iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" > <linux-kernel@xxxxxxxxxxxxxxx>, "linuxarm@xxxxxxxxxx" > <linuxarm@xxxxxxxxxx>, Alex Williamson <alex.williamson@xxxxxxxxxx>, > "linux-crypto@xxxxxxxxxxxxxxx" <linux-crypto@xxxxxxxxxxxxxxx>, Philippe > Ombredanne <pombredanne@xxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, > Hao Fang <fanghao11@xxxxxxxxxx>, "David S . Miller" <davem@xxxxxxxxxxxxx>, > "linux-accelerators@xxxxxxxxxxxxxxxx" > <linux-accelerators@xxxxxxxxxxxxxxxx> > Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive > Message-ID: <20180813192301.GC3451@xxxxxxxxxx> > References: <20180806031252.GG91035@Turing-Arch-b> > <20180806153257.GB6002@xxxxxxxxxx> > <11bace0e-dc14-5d2c-f65c-25b852f4e9ca@xxxxxxxxx> > <20180808151835.GA3429@xxxxxxxxxx> <20180809080352.GI91035@Turing-Arch-b> > <20180809144613.GB3386@xxxxxxxxxx> <20180810033913.GK91035@Turing-Arch-b> > <0f6bac9b-8381-1874-9367-46b5f4cef56e@xxxxxxx> > <6ea4dcfd-d539-93e4-acf1-d09ea35f0ddc@xxxxxxxxx> > <20180813092931.GL91035@Turing-Arch-b> > Content-Type: text/plain; charset="iso-8859-1" > Content-Disposition: inline > Content-Transfer-Encoding: 8bit > In-Reply-To: <20180813092931.GL91035@Turing-Arch-b> > User-Agent: Mutt/1.10.0 (2018-05-17) > X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 > X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 > (mx1.redhat.com [10.11.55.6]); Mon, 13 Aug 2018 19:23:05 +0000 (UTC) > X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com > [10.11.55.6]); Mon, 13 Aug 2018 19:23:05 +0000 (UTC) for IP:'10.11.54.6' > DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' > HELO:'smtp.corp.redhat.com' FROM:'jglisse@xxxxxxxxxx' RCPT:'' > Return-Path: jglisse@xxxxxxxxxx > X-MS-Exchange-Organization-AuthSource: DGGEMM401-HUB.china.huawei.com > X-MS-Exchange-Organization-AuthAs: Anonymous > MIME-Version: 1.0 > > On Mon, Aug 13, 2018 at 05:29:31PM +0800, Kenneth Lee wrote: > > > > I made a quick change basing on the RFCv1 here: > > > > https://github.com/Kenneth-Lee/linux-kernel-warpdrive/commits/warpdrive-v0.6 > > > > I just made it compilable and not test it yet. But it shows how the idea is > > going to be. > > > > The Pros is: most of the virtual device stuff can be removed. Resource > > management is on the openned files only. > > > > The Cons is: as Jean said, we have to redo something that has been done by VFIO. > > These mainly are: > > > > 1. Track the dma operation and remove them on resource releasing > > 2. Pin the memory with gup and do accounting > > > > It not going to be easy to make a decision... > > > > Maybe it would be good to list things you want do. Looking at your tree > it seems you are re-inventing what dma-buf is already doing. My English is quite limited;). I think I did not explain it well in the WrapDrive document. Please let me try again here: The requirement of WrapDrive is simple. Many acceleration requirements are from user space, such as OpenSSL, AI, and so on. We want to provide a framework for the user land application to summon the accelerator easily. So the scenario is simple: The application is doing its job and faces a tough and boring task, say compression. Then it drops the data pointer to the command queue and let the accelerator do it at its bidding, and continue its job until the task is done, or its other task synchronously My understanding to the dma-buf is driver oriented. The buffer is created by one driver and exported as fd to user space, then the user space can share it with some attached devices. But our scenario is that the whole buffer is from the user space. The application will directly assign the pointer to the command queue. The hardware will directly use this address. The kernel should set this particular virtual memory or the whole process space to the IOMMU with its pasid. So the hardware can use the process' address, which may not only be the address assigning in the command queue, it can also be an address inside the memory itself. To let this work, we have to pin the memory or set a page fault handler when the memory is shared (to the hardware). And... the big work of pinning memory is not gup (get user page), but rlimit accounting:) > > So here is what i understand for SVM/SVA: > (1) allow userspace to create a command buffer for a device and bind > it to its address space (PASID) > (2) allow userspace to directly schedule commands on its command buffer > > No need to do tracking here as SVM/SVA which rely on PASID and something > like PCIE ATS (address translation service). Userspace can shoot itself > in the foot but nothing harmful can happen. > Yes, we can release the whole page table based on PASID (It also need Jean to provide this interface;)). But the gup part still need to be tracked. (This is what is done in VFIO) > > Non SVM/SVA: > (3) allow userspace to wrap a region of its memory into an object so > that it can be DMA map (ie GUP + dma_map_page()) > (4) Have userspace schedule command referencing object created in (3) > using an ioctl. Yes, this is going to be something like NOIOMMU mode in VFIO. The hardware have to accept DMA/physical address. But anyway, this not the major intension of WrapDrive. > > We need to keep track of object usage by the hardware so that we know > when it is safe to release resources (dma_unmap_page()). The dma-buf > provides everything you want for (3) and (4). With dma-buf you create > object and each time it is use by a device you associate a fence with > it. When fence is signaled it means that the hardware is done using > that object. Fence also allow proper synchronization between multiple > devices. For instance making sure that the second device wait for the > first device before starting doing its thing. dma-buf documentations is > much more thorough explaining all this. > Your idea hints me that the dma_buf design is based on sharing memory from the driver, not from the user space. That's why the signal is based on the buffer itself, because the buffer can be used again and again. This is good for performance. The cost is high if remap the data every time. But if consider we can devote the whole application space with SVM/SVA support. This can become acceptable. > > Now from implementation point of view, maybe it would be a good idea > to create something like the virtual gem driver. It is a virtual device > that allow to create GEM object. So maybe we want a virtual device that > allow to create dma-buf object from process memory and allow sharing of > those dma-buf between multiple devices. > > Userspace would only have to talk to this virtual device to create > object and wrap its memory around, then it could use this object against > many actual devices. > > > This decouples the memory management, that can be share between all > devices, from the actual device driver, which is obviously specific to > every single device. > Forcing the application to use the device allocate memory is an alluring choice, it makes thing simple. Let us consider it for a while... > > Note that dma-buf use file so that once all file reference are gone the > resource can be free and cleanup can happen (dma_unmap_page() ...). This > properly handle the resource lifetime issue you seem to worried about. > > Cheers, > Jérôme -- -Kenneth(Hisilicon)