Re: [PATCH v13 0/4] Add uacce module for Accelerator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Tue, Feb 11, 2020 at 03:54:21PM +0800, Zhangfei Gao wrote:
> Uacce (Unified/User-space-access-intended Accelerator Framework) targets to
> provide Shared Virtual Addressing (SVA) between accelerators and processes.
> So accelerator can access any data structure of the main cpu.
> This differs from the data sharing between cpu and io device, which share
> data content rather than address.
> Because of unified address, hardware and user space of process can share
> the same virtual address in the communication.
> Uacce is intended to be used with Jean Philippe Brucker's SVA
> patchset[1], which enables IO side page fault and PASID support. 
> We have keep verifying with Jean's sva patchset [2]
> We also keep verifying with Eric's SMMUv3 Nested Stage patches [3]
> This series and related zip & qm driver
> The library and user application:
> References:
> [1]
> [2]
> [3]
> The series contains 4 patches,
> Patch 1 & 2 are for uacce
> Patch 3 & 4 are an example using uacce, which happens to be crypto, can be merged later.
> Change History:
> v13:
> Rebase to v5.6-rc1
> add Reviewd-by Greg
> fix minor issue in zip_main.c, check whether uacce is null then register
> v12:
> Suggested by Greg,
> Remove module_get and module_put in uacce, which blocks rmmod parent module when
> application are running, while application should not forbid a module from being unloaded
> v11:
> add Reviewed-by, and fix one mismatch with sys
> v10:
> Modify the include header to fix kbuild test erorr in other arch.
> v9:
> Suggested by Jonathan
> 1. Remove sysfs: numa_distance, node_id, id, also add is_visible callback
> 2. Split the api to solve the potential race
> struct uacce_device *uacce_alloc(struct device *parent,
> 				 struct uacce_interface *interface)
> int uacce_register(struct uacce_device *uacce)
> void uacce_remove(struct uacce_device *uacce)
> 3. Split clean up patch 03
> v8:
> Address some comments from Jonathan
> Merge Jean's patch, using uacce_mm instead of pid for sva_exit
> v7:
> As suggested by Jean and Jerome
> Only consider sva case and remove unused dma apis for the first patch.
> Also add mm_exit for sva and vm_ops.close etc
> v6:
> Change sys qfrs_size to different file, suggested by Jonathan
> Fix crypto daily build issue and based on crypto code base, also 5.4-rc1.
> v5:
> Add an example patch using the uacce interface, suggested by Greg
> 0003-crypto-hisilicon-register-zip-engine-to-uacce.patch
> v4:
> Based on 5.4-rc1
> Considering other driver integrating uacce, 
> if uacce not compiled, uacce_register return error and uacce_unregister is empty.
> Simplify uacce flag: UACCE_DEV_SVA.
> Address Greg's comments: 
> Fix state machine, remove potential syslog triggered from user space etc.
> v3:
> Recommended by Greg, use sturct uacce_device instead of struct uacce,
> and use struct *cdev in struct uacce_device, as a result, 
> cdev can be released by itself when refcount decreased to 0.
> So the two structures are decoupled and self-maintained by themsleves.
> Also add dev.release for put_device.
> v2:
> Address comments from Greg and Jonathan
> Modify interface uacce_register
> Drop noiommu mode first
> v1:
> 1. Rebase to 5.3-rc1
> 2. Build on iommu interface
> 3. Verifying with Jean's sva and Eric's nested mode iommu.
> 4. User library has developed a lot: support zlib, openssl etc.
> 5. Move to misc first
> RFC3:
> RFC2:
> Background of why Uacce:
> Von Neumann processor is not good at general data manipulation.
> It is designed for control-bound rather than data-bound application.
> The latter need less control path facility and more/specific ALUs.
> So there are more and more heterogeneous processors, such as
> encryption/decryption accelerators, TPUs, or
> EDGE (Explicated Data Graph Execution) processors, introduced to gain
> better performance or power efficiency for particular applications
> these days.
> There are generally two ways to make use of these heterogeneous processors:
> The first is to make them co-processors, just like FPU.
> This is good for some application but it has its own cons:
> It changes the ISA set permanently.
> You must save all state elements when the process is switched out.
> But most data-bound processors have a huge set of state elements.
> It makes the kernel scheduler more complex.
> The second is Accelerator.
> It is taken as a IO device from the CPU's point of view
> (but it need not to be physically). The process, running on CPU,
> hold a context of the accelerator and send instructions to it as if
> it calls a function or thread running with FPU.
> The context is bound with the processor itself.
> So the state elements remain in the hardware context until
> the context is released.
> We believe this is the core feature of an "Accelerator" vs. Co-processor
> or other heterogeneous processors.
> The intention of Uacce is to provide the basic facility to backup
> this scenario. Its first step is to make sure the accelerator and process
> can share the same address space. So the accelerator ISA can directly
> address any data structure of the main CPU.
> This differs from the data sharing between CPU and IO device,
> which share data content rather than address.
> So it is different comparing to the other DMA libraries.
> In the future, we may add more facility to support linking accelerator
> library to the main application, or managing the accelerator context as
> special thread.
> But no matter how, this can be a solid start point for new processor
> to be used as an "accelerator" as this is the essential requirement.
> The Fork Scenario
> =================
> For a process with allocated queues and shared memory, what happen if it forks
> a child?
> The fd of the queue is duplicated on fork, but requests sent from the child
> process are blocked.
> It is recommended to add O_CLOEXEC to the queue file.
> The queue mmap space has a VM_DONTCOPY in its VMA. So the child will lose all
> those VMAs.
> This is a reason why Uacce does not adopt the mode used in VFIO and
> InfiniBand.  Both solutions can set any user pointer for hardware sharing.
> But they cannot support fork when the dma is in process. Or the
> "Copy-On-Write" procedure will make the parent process lost its physical
> pages.
> Difference to the VFIO and IB framework
> ---------------------------------------
> The essential function of Uacce is to let the device access the user
> address directly. There are many device drivers doing the same in the kernel.
> And both VFIO and IB can provide similar functions in framework level.
> But Uacce has a different goal: "share address space". It is
> not taken the request to the accelerator as an enclosure data structure. It
> takes the accelerator as another thread of the same process. So the
> accelerator can refer to any address used by the process.
> Both VFIO and IB are taken this as "memory sharing", not "address sharing".
> They care more on sharing the block of memory. But if there is an address
> stored in the block and referring to another memory region. The address may
> not be valid.
> By adding more constraints to the VFIO and IB framework, in some sense, we may
> achieve a similar goal. But we gave it up finally. Both VFIO and IB have extra
> assumption which is unnecessary to Uacce. They may hurt each other if we
> try to merge them together.
> VFIO manages resource of a hardware as a "virtual device". If a device need to
> serve a separated application. It must isolate the resource as a separate
> virtual device.  And the life cycle of the application and virtual device are
> unnecessary unrelated. And most concepts, such as bus, driver, probe and
> so on, to make it as a "device" is unnecessary either. And the logic added to
> VFIO to make address sharing do no help on "creating a virtual device".
> IB creates a "verbs" standard for sharing memory region to another remote
> entity.  Most of these verbs are to make memory region between entities to be
> synchronized.  This is not what accelerator need. Accelerator is in the same
> memory system with the CPU. It refers to the same memory system among CPU and
> devices. So the local memory terms/verbs are good enough for it. Extra "verbs"
> are not necessary. And its queue (like queue pair in IB) is the communication
> channel direct to the accelerator hardware. There is nothing about memory
> itself.
> Further, both VFIO and IB use the "pin" (get_user_page) way to lock local
> memory in place.  This is flexible. But it can cause other problems. For
> example, if the user process fork a child process. The COW procedure may make
> the parent process lost its pages which are sharing with the device. These may
> be fixed in the future. But is not going to be easy. (There is a discussion
> about this on Linux Plumbers Conference 2018 [1])
> So we choose to build the solution directly on top of IOMMU interface. IOMMU
> is the essential way for device and process to share their page mapping from
> the hardware perspective. It will be safe to create a software solution on
> this assumption.  Uacce manages the IOMMU interface for the accelerator
> device, so the device driver can export some of the resources to the user
> space. Uacce than can make sure the device and the process have the same
> address space.
> References
> ==========
> .. [1]
> Kenneth Lee (2):
>   uacce: Add documents for uacce
>   uacce: add uacce driver
> Zhangfei Gao (2):
>   crypto: hisilicon - Remove module_param uacce_mode
>   crypto: hisilicon - register zip engine to uacce
>  Documentation/ABI/testing/sysfs-driver-uacce |  39 ++
>  Documentation/misc-devices/uacce.rst         | 176 ++++++
>  drivers/crypto/hisilicon/qm.c                | 239 ++++++-
>  drivers/crypto/hisilicon/qm.h                |  11 +
>  drivers/crypto/hisilicon/zip/zip_main.c      |  49 +-
>  drivers/misc/Kconfig                         |   1 +
>  drivers/misc/Makefile                        |   1 +
>  drivers/misc/uacce/Kconfig                   |  13 +
>  drivers/misc/uacce/Makefile                  |   2 +
>  drivers/misc/uacce/uacce.c                   | 617 +++++++++++++++++++
>  include/linux/uacce.h                        | 161 +++++
>  include/uapi/misc/uacce/hisi_qm.h            |  23 +
>  include/uapi/misc/uacce/uacce.h              |  38 ++
>  13 files changed, 1337 insertions(+), 33 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-driver-uacce
>  create mode 100644 Documentation/misc-devices/uacce.rst
>  create mode 100644 drivers/misc/uacce/Kconfig
>  create mode 100644 drivers/misc/uacce/Makefile
>  create mode 100644 drivers/misc/uacce/uacce.c
>  create mode 100644 include/linux/uacce.h
>  create mode 100644 include/uapi/misc/uacce/hisi_qm.h
>  create mode 100644 include/uapi/misc/uacce/uacce.h

All applied.  Thanks.
Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
Home Page:
PGP Key:

[Index of Archives]     [LM Sensors]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux