From: David Francis <David.Francis@xxxxxxx> Add six long comments outlining the basic features of the driver, to aid new developers. Signed-off-by: David Francis <David.Francis@xxxxxxx> Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 74 +++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_device.c | 25 ++++++++ drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 57 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 33 ++++++++++ 4 files changed, 189 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 6abfe10229a2..ea25a47b62dc 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -1031,6 +1031,80 @@ static int kfd_ioctl_get_available_memory(struct file *filep, return 0; } +/** + * DOC: Memory_Types + * + * There are many different types of memory that KFD can manage, each with + * slightly different interfaces + * + * VRAM and GTT + * ------------ + * + * VRAM and GTT can be allocated with the AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl. + * This ioctl returns a handle used to refer to the memory in future kfd ioctls, + * as well as a mmap_offset used for mapping the allocation on the CPU. VRAM + * memory is located on the GPU, while GTT memory is located in host memory. + * Once memory is allocated, it must be mapped with the + * AMD_KFD_IOC_MAP_MEMORY_TO_GPU ioctl before the GPU can access it. + * + * Doorbell and MMIO + * ----------------- + * + * Each process is assigned two pages of doorbell memory used to signal that + * usermode queues have awaiting packets. AMDKFD_IOC_ALLOC_MEMORY_OF_GPU + * associates these pages with a virtual address. They must still be mapped if + * the GPU is to access them. + * + * There is one page of MMIO memory per GPU that is accessible to userspace by + * the same means. + * + * userptr + * ------- + * + * userptr memory is user-allocated system memory, alloacted with malloc or + * similar. As with doorbell and MMIO memory, AMDKFD_IOC_ALLOC_MEMORY_OF_GPU + * does not allocate the memory; instead it registers existing memory for + * mapping. + * + * SVM + * --- + * + * SVM is a different memory-allocation API available on GFX9+. Like userptr + * memory, SVM maps existing user-managed memory onto the GPU. + * + * XNACK is an SVM feature that is disabled by default as it has a performance + * cost. When XNACK is enabled, SVM memory can perform recoverable page faults, + * allowing KFD to allocate memory without reserving physical address space, + * performing the physical allocation only on page fault. With XNACK, SVM + * uses the Heterogenous Memory Manager (HMM) to migrate pages back and forth + * between the device and the host in reponse to memory pressure and page faults. + * + * Scratch + * ------- + * + * Scratch memory is VRAM memory on a GPU reserved for holding intermediate + * values during a shader's execution. A user (usually ROCr) can allocate + * scratch memory by allocating VRAM memory and then using the + * AMDKFD_IOC_SET_SCRATCH_BACKING_VA ioctl. + */ + +/** + * DOC: Memory_Implementation + * + * The GPU page tables need to be kept in sync with the CPU page tables; if a + * page is moved, swapped, or evicted by linux's normal memory manager, a callback + * is made into kfd, which must pause hardware access to the memory while the + * operation is in progress. + * + * Compute shaders can cause thrashing if the total memory in use exceeds the + * GPU or system's memory limits. Because user command submission is via + * usermode queues, with no driver involvement, all memory must be physically + * resident at all times (this is different from the graphics approach, which + * can swap memory on and off the GPU as needed). KFD prevents overcommitment + * of memory by keeping an account of how much memory processes have allocated, + * and refusing to allocate beyond a threshold. + */ + static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep, struct kfd_process *p, void *data) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index f5853835f03a..76d1842c9333 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -37,6 +37,31 @@ #define MQD_SIZE_ALIGNED 768 +/** + * DOC: Discovery + * + * There are two phases of initialization and topology discovery in KFD. The + * first, module_init occurs when the module is built into the kernel (on boot + * or modprobe). The second, device_init, occurs when linux discovers a PCI + * device that is an AMD GPU (on boot or hotplug). + * + * module_init begins when the amdgpu driver is initialized (amdgpu_drv.c), + * which calls kfd_init() in kfd_module.c. At this time, the chardev is created + * so that ioctls can be submitted, and the topology is queried, creating the + * sysfs layout. Some AMD APUs make their topology information + * available through a BIOS structure called a CRAT table. If no CRAT table is + * found, KFD will make one up with the information available to it. Discrete + * GPUs are not discovered at this time; only CPUs and APUs. At this + * point, AMDGPU registers itself as a PCIe driver. + * + * device_init begins when linux finds a device with a PCIe ID matching an entry + * amdgpu is registered for. If the device contains compute functionality, + * amdgpu will call kgd2kfd_probe() and kgd2kfd_device_init() in kfd_device.c + * (kgd2kfd stands for Kernel Graphics Driver to Kernel Fusion Driver) to set up + * shared resources such as non-compute doorbells and add the new device to the + * topology. + */ + /* * kfd_locked is used to lock the kfd driver during suspend or reset * once locked, kfd driver will stop any further GPU execution. diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c index 0f6992b1895c..3c1a2be18d4c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c @@ -25,6 +25,63 @@ #include <linux/slab.h> #include "kfd_priv.h" +/** + * DOC: Queue_Interface + * + * A process can create queues with the ioctl AMDKFD_IOC_CREATE_QUEUE, which + * returns a queue id used as a handle, the addresses of the read and write + * pointers, and the doorbell location. Up to 256 processes can have queues, + * and each process can have up to 1024 queues. + * + * A doorbell is a 64-bit memory-mapped register on the GPU that a + * process can write to to signal that the corresponding queue has packets + * waiting in it. + * + * A queue can be either an compute queue, used for computation, or an SDMA queue, + * used for data transfers. + * + * Most HSA queues take commands in the form of 64-byte AQL packets. Most + * commonly, this will be a kernel dispatch packet containing a pointer to the + * kernel to be executed. A kernel is a small program that performs an + * elementary operation such as a vector sum, matrix multiplication, or + * scatter-gather operation. A single user program may be made up of many + * kernels. Other AQL packets include barrier packets used for synchronization + * between shaders and PM4_IB packets that flush the cache, used for profiling. + * Packets in the same queue will begin execution in order, but can run + * concurrently. DMA queues are similar, but use a different packet format. + * + * A queue contains a ringbuffer with read and write pointers used to submit + * packets (the size of the ringbuffer is specified when the queue is + * created). To write to a queue, a process first atomically moves the + * write pointer forward. Then, it writes each of the packets to the buffer, + * leaving the headers empty so that if hardware attempts to consume the packets + * at this point, it will find them invalid. Then it writes the headers and + * signals the doorbell. + * + * In addition to the user mode queues described here, there are kernel mode + * queues used internally by KFD to communicate with various elements of the + * hardware; these work similarly. + */ + +/** + * DOC: Queue_Implementation + * + * Although there may be thousands of queues attached to processes, the + * hardware engines have a limited number of queue slots, usually 32 or fewer + * compute and 10 or fewer DMA per GPU. The hardware will detect doorbell + * signals directly only from queues mapped to an engine. The hardware + * scheduler will periodically poll for unmapped queues with work waiting and + * map them, unmapping empty queues to make room. + * + * Compute shaders can be interrupted partway through execution by the hardware + * scheduler. In that case, the shader's current state will be saved to a + * usermode buffer so it can be restored at a later time. These buffers are + * large, and each queue requires its own buffer, so queues are memory-expensive + * objects. The context save/restore process is initiated with a trap handler on + * the GPU. The trap handler itself is located in the driver, written in SP3 + * assembly code. + */ + void print_queue_properties(struct queue_properties *q) { if (!q) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c index 25990bec600d..8c2910b98ece 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c @@ -42,6 +42,39 @@ #include "amdgpu_ras.h" #include "amdgpu.h" +/** + * DOC: Topology + * + * The GPU component of an APU or iGPU, or a discrete GPU, is a GPU device. The + * CPU is also a device (known as "System" or "Host" to KFD). + * + * A node is a memory domain. Most devices are a single node, but certain GPUs + * may contain multiple nodes, depending on how they are configured. Each GPU + * node has its own L2 data cache. + * + * A GPU contains multiple Shader Engines (SEs). Each shader engine has its own + * sub-scheduler to divide up work within the SE. + * + * A Shader Engine contains multiple Compute Units (CUs). All processing in a + * CU will share all caches, such that two threads running in the same + * CU will be able to easily communicate and synchronize + * + * A Compute Unit contains multiple Single Instruction Multiple Data units + * (SIMDs). A SIMD can run programs that perform operations on 32- or 64-element + * vectors. A program running on a single SIMD is called a wavefront. + * + * In addition to the processing capabilities, the topology also includes the + * IO links between nodes. GPU nodes may be connected to each other or the + * system via XGMI or PCIe links. + * + * Topology information is available through sysfs at + * /sys/devices/virtual/kfd/kfd/topology or through symbolic link at + * /sys/class/kfd/kfd/topology. The generation_id field in that directory is + * incremented each time the topology is updated. To ensure a consistent view of + * the topology, user programs should read generation_id before and after + * checking the topology, and retry if the values are not the same. + */ + /* topology_device_list - Master list of all topology devices */ static struct list_head topology_device_list; static struct kfd_system_properties sys_props; -- 2.25.1