>-----Original Message----- >From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of >Bridgman, John >Sent: Tuesday, February 13, 2018 6:42 PM >To: Ming Yang; Kuehling, Felix >Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org >Subject: RE: Documentation about AMD's HSA implementation? > > > >>-----Original Message----- >>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf >>Of Ming Yang >>Sent: Tuesday, February 13, 2018 4:59 PM >>To: Kuehling, Felix >>Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org >>Subject: Re: Documentation about AMD's HSA implementation? >> >>That's very helpful, thanks! >> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling >><felix.kuehling at amd.com> >>wrote: >>> On 2018-02-13 04:06 PM, Ming Yang wrote: >>>> Thanks for the suggestions! But I might ask several specific >>>> questions, as I can't find the answer in those documents, to give >>>> myself a quick start if that's okay. Pointing me to the >>>> files/functions would be good enough. Any explanations are >>>> appreciated. My purpose is to hack it with different scheduling >>>> policy with real-time and predictability consideration. >>>> >>>> - Where/How is the packet scheduler implemented? How are packets >>>> from multiple queues scheduled? What about scheduling packets from >>>> queues in different address spaces? >>> >>> This is done mostly in firmware. The CP engine supports up to 32 queues. >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use. >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP >>> micro engine. Within each pipe the queues are time-multiplexed. >> >>Please correct me if I'm wrong. CP is computing processor, like the >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler >>multiplexing queues, in order to hide memory latency. > >CP is one step back from that - it's a "command processor" which reads >command packets from driver (PM4 format) or application (AQL format) then >manages the execution of each command on the GPU. A typical packet might >be "dispatch", which initiates a compute operation on an N-dimensional array, >or "draw" which initiates the rendering of an array of triangles. Those >compute and render commands then generate a (typically) large number of >wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of >our recent GPUs have one micro engine for graphics ("ME") and two for >compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE". I missed one important point - "CP" refers to the combination of ME, MEC(s) and a few other related blocks. >> >>> >>> If we need more than 24 queues, or if we have more than 8 processes, >>> the hardware scheduler (HWS) adds another layer scheduling, basically >>> round-robin between batches of 24 queues or 8 processes. Once you get >>> into such an over-subscribed scenario your performance and GPU >>> utilization can suffers quite badly. >> >>HWS is also implemented in the firmware that's closed-source? > >Correct - HWS is implemented in the MEC microcode. We also include a simple >SW scheduler in the open source driver code, however. >> >>> >>>> >>>> - I noticed the new support of concurrency of multi-processes in the >>>> archive of this mailing list. Could you point me to the code that >>>> implements this? >>> >>> That's basically just a switch that tells the firmware that it is >>> allowed to schedule queues from different processes at the same time. >>> The upper limit is the number of VMIDs that HWS can work with. It >>> needs to assign a unique VMID to each process (each VMID representing >>> a separate address space, page table, etc.). If there are more >>> processes than VMIDs, the HWS has to time-multiplex. >> >>HWS dispatch packets in their order of becoming the head of the queue, >>i.e., being pointed by the read_index? So in this way it's FIFO. Or >>round-robin between queues? You mentioned round-robin over batches in >>the over- subscribed scenario. > >Round robin between sets of queues. The HWS logic generates sets as >follows: > >1. "set resources" packet from driver tells scheduler how many VMIDs and >HW queues it can use > >2. "runlist" packet from driver provides list of processes and list of queues for >each process > >3. if multi-process switch not set, HWS schedules as many queues from the >first process in the runlist as it has HW queues (see #1) > >4. at the end of process quantum (set by driver) either switch to next process >(if all queues from first process have been scheduled) or schedule next set of >queues from the same process > >5. when all queues from all processes have been scheduled and run for a >process quantum, go back to the start of the runlist and repeat > >If the multi-process switch is set, and the number of queues for a process is >less than the number of HW queues available, then in step #3 above HWS will >start scheduling queues for additional processes, using a different VMID for >each process, and continue until it either runs out of VMIDs or HW queues (or >reaches the end of the runlist). All of the queues and processes would then >run together for a process quantum before switching to the next queue set. > >> >>This might not be a big deal for performance, but it matters for >>predictability and real-time analysis. > >Agreed. In general you would not want to overcommit either VMIDs or HW >queues in a real-time scenario, and for hard real time you would probably >want to limit to a single queue per pipe since the MEC also multiplexes >between HW queues on a pipe even without HWS. > >> >>> >>>> >>>> - Also another related question -- where/how is the >>>> preemption/context switch between packets/queues implemented? >>> >>> As long as you don't oversubscribe the available VMIDs, there is no >>> real context switching. Everything can run concurrently. When you >>> start oversubscribing HW queues or VMIDs, the HWS firmware will start >>> multiplexing. This is all handled inside the firmware and is quite >>> transparent even to KFD. >> >>I see. So the preemption in at least AMD's implementation is not >>switching out the executing kernel, but just letting new kernels to run >>concurrently with the existing ones. This means the performance is >>degraded when too many workloads are submitted. The running kernels >>leave the GPU only when they are done. > >Both - you can have multiple kernels executing concurrently (each generating >multiple threads in the shader core) AND switch out the currently executing >set of kernels via preemption. > >> >>Is there any reason for not preempting/switching out the existing >>kernel, besides context switch overheads? NVIDIA is not providing this >option either. >>Non-preemption hurts the real-time property in terms of priority >>inversion. I understand preemption should not be massively used but >>having such an option may help a lot for real-time systems. > >If I understand you correctly, you can have it either way depending on the >number of queues you enable simultaneously. At any given time you are >typically only going to be running the kernels from one queue on each pipe, ie >with 3 pipes and 24 queues you would typically only be running 3 kernels at a >time. This seemed like a good compromise between scalability and efficiency. > >> >>> >>> KFD interacts with the HWS firmware through the HIQ (HSA interface >>> queue). It supports packets for unmapping queues, we can send it a >>> new runlist (basically a bunch of map-process and map-queue packets). >>> The interesting files to look at are kfd_packet_manager.c, >>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c. >>> >> >>So in this way, if we want to implement different scheduling policy, we >>should control the submission of packets to the queues in runtime/KFD, >>before getting to the firmware. Because it's out of access once it's >>submitted to the HWS in the firmware. > >Correct - there is a tradeoff between "easily scheduling lots of work" and fine- >grained control. Limiting the number of queues you run simultaneously is >another way of taking back control. > >You're probably past this, but you might find the original introduction to KFD >useful in some way: > >https://lwn.net/Articles/605153/ > >> >>Best, >>Mark >> >>> Regards, >>> Felix >>> >>>> >>>> Thanks in advance! >>>> >>>> Best, >>>> Mark >>>> >>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling at amd.com> >>wrote: >>>>> There is also this: https://gpuopen.com/professional-compute/, >>>>> which give pointer to several libraries and tools that built on top of >ROCm. >>>>> >>>>> Another thing to keep in mind is, that ROCm is diverging from the >>>>> strict HSA standard in some important ways. For example the HSA >>>>> standard includes HSAIL as an intermediate representation that gets >>>>> finalized on the target system, whereas ROCm compiles directly to >>>>> native >>GPU ISA. >>>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander >><Alexander.Deucher at amd.com> wrote: >>>>>> The ROCm documentation is probably a good place to start: >>>>>> >>>>>> https://rocm.github.io/documentation.html >>>>>> >>>>>> >>>>>> Alex >>>>>> >>>>>> ________________________________ >>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf >of >>>>>> Ming Yang <minos.future at gmail.com> >>>>>> Sent: Tuesday, February 13, 2018 12:00 AM >>>>>> To: amd-gfx at lists.freedesktop.org >>>>>> Subject: Documentation about AMD's HSA implementation? >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm interested in HSA and excited when I found AMD's fully >>>>>> open-stack ROCm supporting it. Before digging into the code, I >>>>>> wonder if there's any documentation available about AMD's HSA >>>>>> implementation, either book, whitepaper, paper, or documentation. >>>>>> >>>>>> I did find helpful materials about HSA, including HSA standards on >>>>>> this page >>>>>> (http://www.hsafoundation.com/standards/) and a nice book about >>HSA >>>>>> (Heterogeneous System Architecture A New Compute Platform >>Infrastructure). >>>>>> But regarding the documentation about AMD's implementation, I >>>>>> haven't found anything yet. >>>>>> >>>>>> Please let me know if there are ones publicly accessible. If no, >>>>>> any suggestions on learning the implementation of specific system >>>>>> components, e.g., queue scheduling. >>>>>> >>>>>> Best, >>>>>> Mark >>> >>_______________________________________________ >>amd-gfx mailing list >>amd-gfx at lists.freedesktop.org >>https://lists.freedesktop.org/mailman/listinfo/amd-gfx >_______________________________________________ >amd-gfx mailing list >amd-gfx at lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/amd-gfx