On Tue, Feb 11, 2020 at 03:54:21PM +0800, Zhangfei Gao wrote: > Uacce (Unified/User-space-access-intended Accelerator Framework) targets to > provide Shared Virtual Addressing (SVA) between accelerators and processes. > So accelerator can access any data structure of the main cpu. > This differs from the data sharing between cpu and io device, which share > data content rather than address. > Because of unified address, hardware and user space of process can share > the same virtual address in the communication. > > Uacce is intended to be used with Jean Philippe Brucker's SVA > patchset[1], which enables IO side page fault and PASID support. > We have keep verifying with Jean's sva patchset [2] > We also keep verifying with Eric's SMMUv3 Nested Stage patches [3] > > This series and related zip & qm driver > https://github.com/Linaro/linux-kernel-warpdrive/tree/v5.6-rc1-uacce-v13 > > The library and user application: > https://github.com/Linaro/warpdrive/tree/wdprd-upstream-v13 > > References: > [1] http://jpbrucker.net/sva/ > [2] http://jpbrucker.net/git/linux/log/?h=sva/zip-devel > [3] https://github.com/eauger/linux/tree/v5.3.0-rc0-2stage-v9 > > The series contains 4 patches, > Patch 1 & 2 are for uacce > Patch 3 & 4 are an example using uacce, which happens to be crypto, can be merged later. > > Change History: > v13: > Rebase to v5.6-rc1 > add Reviewd-by Greg > fix minor issue in zip_main.c, check whether uacce is null then register > > v12: > Suggested by Greg, > Remove module_get and module_put in uacce, which blocks rmmod parent module when > application are running, while application should not forbid a module from being unloaded > > v11: > add Reviewed-by, and fix one mismatch with sys > > v10: > Modify the include header to fix kbuild test erorr in other arch. > > v9: > Suggested by Jonathan > 1. Remove sysfs: numa_distance, node_id, id, also add is_visible callback > 2. Split the api to solve the potential race > struct uacce_device *uacce_alloc(struct device *parent, > struct uacce_interface *interface) > int uacce_register(struct uacce_device *uacce) > void uacce_remove(struct uacce_device *uacce) > 3. Split clean up patch 03 > > v8: > Address some comments from Jonathan > Merge Jean's patch, using uacce_mm instead of pid for sva_exit > > v7: > As suggested by Jean and Jerome > Only consider sva case and remove unused dma apis for the first patch. > Also add mm_exit for sva and vm_ops.close etc > > > v6: https://lkml.org/lkml/2019/10/16/231 > Change sys qfrs_size to different file, suggested by Jonathan > Fix crypto daily build issue and based on crypto code base, also 5.4-rc1. > > v5: https://lkml.org/lkml/2019/10/14/74 > Add an example patch using the uacce interface, suggested by Greg > 0003-crypto-hisilicon-register-zip-engine-to-uacce.patch > > v4: https://lkml.org/lkml/2019/9/17/116 > Based on 5.4-rc1 > Considering other driver integrating uacce, > if uacce not compiled, uacce_register return error and uacce_unregister is empty. > Simplify uacce flag: UACCE_DEV_SVA. > Address Greg's comments: > Fix state machine, remove potential syslog triggered from user space etc. > > v3: https://lkml.org/lkml/2019/9/2/990 > Recommended by Greg, use sturct uacce_device instead of struct uacce, > and use struct *cdev in struct uacce_device, as a result, > cdev can be released by itself when refcount decreased to 0. > So the two structures are decoupled and self-maintained by themsleves. > Also add dev.release for put_device. > > v2: https://lkml.org/lkml/2019/8/28/565 > Address comments from Greg and Jonathan > Modify interface uacce_register > Drop noiommu mode first > > v1: https://lkml.org/lkml/2019/8/14/277 > 1. Rebase to 5.3-rc1 > 2. Build on iommu interface > 3. Verifying with Jean's sva and Eric's nested mode iommu. > 4. User library has developed a lot: support zlib, openssl etc. > 5. Move to misc first > > RFC3: > https://lkml.org/lkml/2018/11/12/1951 > > RFC2: > https://lwn.net/Articles/763990/ > > > Background of why Uacce: > Von Neumann processor is not good at general data manipulation. > It is designed for control-bound rather than data-bound application. > The latter need less control path facility and more/specific ALUs. > So there are more and more heterogeneous processors, such as > encryption/decryption accelerators, TPUs, or > EDGE (Explicated Data Graph Execution) processors, introduced to gain > better performance or power efficiency for particular applications > these days. > > There are generally two ways to make use of these heterogeneous processors: > > The first is to make them co-processors, just like FPU. > This is good for some application but it has its own cons: > It changes the ISA set permanently. > You must save all state elements when the process is switched out. > But most data-bound processors have a huge set of state elements. > It makes the kernel scheduler more complex. > > The second is Accelerator. > It is taken as a IO device from the CPU's point of view > (but it need not to be physically). The process, running on CPU, > hold a context of the accelerator and send instructions to it as if > it calls a function or thread running with FPU. > The context is bound with the processor itself. > So the state elements remain in the hardware context until > the context is released. > > We believe this is the core feature of an "Accelerator" vs. Co-processor > or other heterogeneous processors. > > The intention of Uacce is to provide the basic facility to backup > this scenario. Its first step is to make sure the accelerator and process > can share the same address space. So the accelerator ISA can directly > address any data structure of the main CPU. > This differs from the data sharing between CPU and IO device, > which share data content rather than address. > So it is different comparing to the other DMA libraries. > > In the future, we may add more facility to support linking accelerator > library to the main application, or managing the accelerator context as > special thread. > But no matter how, this can be a solid start point for new processor > to be used as an "accelerator" as this is the essential requirement. > > > The Fork Scenario > ================= > For a process with allocated queues and shared memory, what happen if it forks > a child? > > The fd of the queue is duplicated on fork, but requests sent from the child > process are blocked. > > It is recommended to add O_CLOEXEC to the queue file. > > The queue mmap space has a VM_DONTCOPY in its VMA. So the child will lose all > those VMAs. > > This is a reason why Uacce does not adopt the mode used in VFIO and > InfiniBand. Both solutions can set any user pointer for hardware sharing. > But they cannot support fork when the dma is in process. Or the > "Copy-On-Write" procedure will make the parent process lost its physical > pages. > > > Difference to the VFIO and IB framework > --------------------------------------- > The essential function of Uacce is to let the device access the user > address directly. There are many device drivers doing the same in the kernel. > And both VFIO and IB can provide similar functions in framework level. > > But Uacce has a different goal: "share address space". It is > not taken the request to the accelerator as an enclosure data structure. It > takes the accelerator as another thread of the same process. So the > accelerator can refer to any address used by the process. > > Both VFIO and IB are taken this as "memory sharing", not "address sharing". > They care more on sharing the block of memory. But if there is an address > stored in the block and referring to another memory region. The address may > not be valid. > > By adding more constraints to the VFIO and IB framework, in some sense, we may > achieve a similar goal. But we gave it up finally. Both VFIO and IB have extra > assumption which is unnecessary to Uacce. They may hurt each other if we > try to merge them together. > > VFIO manages resource of a hardware as a "virtual device". If a device need to > serve a separated application. It must isolate the resource as a separate > virtual device. And the life cycle of the application and virtual device are > unnecessary unrelated. And most concepts, such as bus, driver, probe and > so on, to make it as a "device" is unnecessary either. And the logic added to > VFIO to make address sharing do no help on "creating a virtual device". > > IB creates a "verbs" standard for sharing memory region to another remote > entity. Most of these verbs are to make memory region between entities to be > synchronized. This is not what accelerator need. Accelerator is in the same > memory system with the CPU. It refers to the same memory system among CPU and > devices. So the local memory terms/verbs are good enough for it. Extra "verbs" > are not necessary. And its queue (like queue pair in IB) is the communication > channel direct to the accelerator hardware. There is nothing about memory > itself. > > Further, both VFIO and IB use the "pin" (get_user_page) way to lock local > memory in place. This is flexible. But it can cause other problems. For > example, if the user process fork a child process. The COW procedure may make > the parent process lost its pages which are sharing with the device. These may > be fixed in the future. But is not going to be easy. (There is a discussion > about this on Linux Plumbers Conference 2018 [1]) > > So we choose to build the solution directly on top of IOMMU interface. IOMMU > is the essential way for device and process to share their page mapping from > the hardware perspective. It will be safe to create a software solution on > this assumption. Uacce manages the IOMMU interface for the accelerator > device, so the device driver can export some of the resources to the user > space. Uacce than can make sure the device and the process have the same > address space. > > > References > ========== > .. [1] https://lwn.net/Articles/774411/ > > Kenneth Lee (2): > uacce: Add documents for uacce > uacce: add uacce driver > > Zhangfei Gao (2): > crypto: hisilicon - Remove module_param uacce_mode > crypto: hisilicon - register zip engine to uacce > > Documentation/ABI/testing/sysfs-driver-uacce | 39 ++ > Documentation/misc-devices/uacce.rst | 176 ++++++ > drivers/crypto/hisilicon/qm.c | 239 ++++++- > drivers/crypto/hisilicon/qm.h | 11 + > drivers/crypto/hisilicon/zip/zip_main.c | 49 +- > drivers/misc/Kconfig | 1 + > drivers/misc/Makefile | 1 + > drivers/misc/uacce/Kconfig | 13 + > drivers/misc/uacce/Makefile | 2 + > drivers/misc/uacce/uacce.c | 617 +++++++++++++++++++ > include/linux/uacce.h | 161 +++++ > include/uapi/misc/uacce/hisi_qm.h | 23 + > include/uapi/misc/uacce/uacce.h | 38 ++ > 13 files changed, 1337 insertions(+), 33 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-driver-uacce > create mode 100644 Documentation/misc-devices/uacce.rst > create mode 100644 drivers/misc/uacce/Kconfig > create mode 100644 drivers/misc/uacce/Makefile > create mode 100644 drivers/misc/uacce/uacce.c > create mode 100644 include/linux/uacce.h > create mode 100644 include/uapi/misc/uacce/hisi_qm.h > create mode 100644 include/uapi/misc/uacce/uacce.h All applied. Thanks. -- Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt