Thanks for all the inputs. Very helpful! I think I have a general understanding of the queue scheduling now and it's time for me to read more code and materials and do some experiments. I'll come back with more questions hopefully. :-) Hi David, please don't hesitate to share more documents. I might find helpful information from them eventually. People like me may benefit from them someway in the future. Best, Ming (Mark) On Tue, Feb 13, 2018 at 7:14 PM, Panariti, David <David.Panariti at amd.com> wrote: > I found a bunch of doc whilst spelunking info for another project. > I'm not sure what's up-to-date, correct, useful, etc. > I've attached one. > Let me know if you want any more. > > davep > >> -----Original Message----- >> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf >> Of Bridgman, John >> Sent: Tuesday, February 13, 2018 6:45 PM >> To: Bridgman, John <John.Bridgman at amd.com>; Ming Yang >> <minos.future at gmail.com>; Kuehling, Felix <Felix.Kuehling at amd.com> >> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; amd- >> gfx at lists.freedesktop.org >> Subject: RE: Documentation about AMD's HSA implementation? >> >> >> >-----Original Message----- >> >From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf >> >Of Bridgman, John >> >Sent: Tuesday, February 13, 2018 6:42 PM >> >To: Ming Yang; Kuehling, Felix >> >Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org >> >Subject: RE: Documentation about AMD's HSA implementation? >> > >> > >> > >> >>-----Original Message----- >> >>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On >> Behalf >> >>Of Ming Yang >> >>Sent: Tuesday, February 13, 2018 4:59 PM >> >>To: Kuehling, Felix >> >>Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org >> >>Subject: Re: Documentation about AMD's HSA implementation? >> >> >> >>That's very helpful, thanks! >> >> >> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling >> >><felix.kuehling at amd.com> >> >>wrote: >> >>> On 2018-02-13 04:06 PM, Ming Yang wrote: >> >>>> Thanks for the suggestions! But I might ask several specific >> >>>> questions, as I can't find the answer in those documents, to give >> >>>> myself a quick start if that's okay. Pointing me to the >> >>>> files/functions would be good enough. Any explanations are >> >>>> appreciated. My purpose is to hack it with different scheduling >> >>>> policy with real-time and predictability consideration. >> >>>> >> >>>> - Where/How is the packet scheduler implemented? How are packets >> >>>> from multiple queues scheduled? What about scheduling packets from >> >>>> queues in different address spaces? >> >>> >> >>> This is done mostly in firmware. The CP engine supports up to 32 >> queues. >> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use. >> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP >> >>> micro engine. Within each pipe the queues are time-multiplexed. >> >> >> >>Please correct me if I'm wrong. CP is computing processor, like the >> >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) >> >>scheduler multiplexing queues, in order to hide memory latency. >> > >> >CP is one step back from that - it's a "command processor" which reads >> >command packets from driver (PM4 format) or application (AQL format) >> >then manages the execution of each command on the GPU. A typical >> packet >> >might be "dispatch", which initiates a compute operation on an >> >N-dimensional array, or "draw" which initiates the rendering of an >> >array of triangles. Those compute and render commands then generate a >> >(typically) large number of wavefronts which are multiplexed on the >> >shader core (by SQ IIRC). Most of our recent GPUs have one micro engine >> >for graphics ("ME") and two for compute ("MEC"). Marketing refers to each >> pipe on an MEC block as an "ACE". >> >> I missed one important point - "CP" refers to the combination of ME, MEC(s) >> and a few other related blocks. >> >> >> >> >>> >> >>> If we need more than 24 queues, or if we have more than 8 processes, >> >>> the hardware scheduler (HWS) adds another layer scheduling, >> >>> basically round-robin between batches of 24 queues or 8 processes. >> >>> Once you get into such an over-subscribed scenario your performance >> >>> and GPU utilization can suffers quite badly. >> >> >> >>HWS is also implemented in the firmware that's closed-source? >> > >> >Correct - HWS is implemented in the MEC microcode. We also include a >> >simple SW scheduler in the open source driver code, however. >> >> >> >>> >> >>>> >> >>>> - I noticed the new support of concurrency of multi-processes in >> >>>> the archive of this mailing list. Could you point me to the code >> >>>> that implements this? >> >>> >> >>> That's basically just a switch that tells the firmware that it is >> >>> allowed to schedule queues from different processes at the same time. >> >>> The upper limit is the number of VMIDs that HWS can work with. It >> >>> needs to assign a unique VMID to each process (each VMID >> >>> representing a separate address space, page table, etc.). If there >> >>> are more processes than VMIDs, the HWS has to time-multiplex. >> >> >> >>HWS dispatch packets in their order of becoming the head of the queue, >> >>i.e., being pointed by the read_index? So in this way it's FIFO. Or >> >>round-robin between queues? You mentioned round-robin over batches >> in >> >>the over- subscribed scenario. >> > >> >Round robin between sets of queues. The HWS logic generates sets as >> >follows: >> > >> >1. "set resources" packet from driver tells scheduler how many VMIDs >> >and HW queues it can use >> > >> >2. "runlist" packet from driver provides list of processes and list of >> >queues for each process >> > >> >3. if multi-process switch not set, HWS schedules as many queues from >> >the first process in the runlist as it has HW queues (see #1) >> > >> >4. at the end of process quantum (set by driver) either switch to next >> >process (if all queues from first process have been scheduled) or >> >schedule next set of queues from the same process >> > >> >5. when all queues from all processes have been scheduled and run for a >> >process quantum, go back to the start of the runlist and repeat >> > >> >If the multi-process switch is set, and the number of queues for a >> >process is less than the number of HW queues available, then in step #3 >> >above HWS will start scheduling queues for additional processes, using >> >a different VMID for each process, and continue until it either runs >> >out of VMIDs or HW queues (or reaches the end of the runlist). All of >> >the queues and processes would then run together for a process quantum >> before switching to the next queue set. >> > >> >> >> >>This might not be a big deal for performance, but it matters for >> >>predictability and real-time analysis. >> > >> >Agreed. In general you would not want to overcommit either VMIDs or HW >> >queues in a real-time scenario, and for hard real time you would >> >probably want to limit to a single queue per pipe since the MEC also >> >multiplexes between HW queues on a pipe even without HWS. >> > >> >> >> >>> >> >>>> >> >>>> - Also another related question -- where/how is the >> >>>> preemption/context switch between packets/queues implemented? >> >>> >> >>> As long as you don't oversubscribe the available VMIDs, there is no >> >>> real context switching. Everything can run concurrently. When you >> >>> start oversubscribing HW queues or VMIDs, the HWS firmware will >> >>> start multiplexing. This is all handled inside the firmware and is >> >>> quite transparent even to KFD. >> >> >> >>I see. So the preemption in at least AMD's implementation is not >> >>switching out the executing kernel, but just letting new kernels to >> >>run concurrently with the existing ones. This means the performance >> >>is degraded when too many workloads are submitted. The running >> >>kernels leave the GPU only when they are done. >> > >> >Both - you can have multiple kernels executing concurrently (each >> >generating multiple threads in the shader core) AND switch out the >> >currently executing set of kernels via preemption. >> > >> >> >> >>Is there any reason for not preempting/switching out the existing >> >>kernel, besides context switch overheads? NVIDIA is not providing >> >>this >> >option either. >> >>Non-preemption hurts the real-time property in terms of priority >> >>inversion. I understand preemption should not be massively used but >> >>having such an option may help a lot for real-time systems. >> > >> >If I understand you correctly, you can have it either way depending on >> >the number of queues you enable simultaneously. At any given time you >> >are typically only going to be running the kernels from one queue on >> >each pipe, ie with 3 pipes and 24 queues you would typically only be >> >running 3 kernels at a time. This seemed like a good compromise between >> scalability and efficiency. >> > >> >> >> >>> >> >>> KFD interacts with the HWS firmware through the HIQ (HSA interface >> >>> queue). It supports packets for unmapping queues, we can send it a >> >>> new runlist (basically a bunch of map-process and map-queue packets). >> >>> The interesting files to look at are kfd_packet_manager.c, >> >>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c. >> >>> >> >> >> >>So in this way, if we want to implement different scheduling policy, >> >>we should control the submission of packets to the queues in >> >>runtime/KFD, before getting to the firmware. Because it's out of >> >>access once it's submitted to the HWS in the firmware. >> > >> >Correct - there is a tradeoff between "easily scheduling lots of work" >> >and fine- grained control. Limiting the number of queues you run >> >simultaneously is another way of taking back control. >> > >> >You're probably past this, but you might find the original introduction >> >to KFD useful in some way: >> > >> >https://lwn.net/Articles/605153/ >> > >> >> >> >>Best, >> >>Mark >> >> >> >>> Regards, >> >>> Felix >> >>> >> >>>> >> >>>> Thanks in advance! >> >>>> >> >>>> Best, >> >>>> Mark >> >>>> >> >>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling >> >>>>> <felix.kuehling at amd.com> >> >>wrote: >> >>>>> There is also this: https://gpuopen.com/professional-compute/, >> >>>>> which give pointer to several libraries and tools that built on >> >>>>> top of >> >ROCm. >> >>>>> >> >>>>> Another thing to keep in mind is, that ROCm is diverging from the >> >>>>> strict HSA standard in some important ways. For example the HSA >> >>>>> standard includes HSAIL as an intermediate representation that >> >>>>> gets finalized on the target system, whereas ROCm compiles >> >>>>> directly to native >> >>GPU ISA. >> >>>>> >> >>>>> Regards, >> >>>>> Felix >> >>>>> >> >>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander >> >><Alexander.Deucher at amd.com> wrote: >> >>>>>> The ROCm documentation is probably a good place to start: >> >>>>>> >> >>>>>> https://rocm.github.io/documentation.html >> >>>>>> >> >>>>>> >> >>>>>> Alex >> >>>>>> >> >>>>>> ________________________________ >> >>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf >> >of >> >>>>>> Ming Yang <minos.future at gmail.com> >> >>>>>> Sent: Tuesday, February 13, 2018 12:00 AM >> >>>>>> To: amd-gfx at lists.freedesktop.org >> >>>>>> Subject: Documentation about AMD's HSA implementation? >> >>>>>> >> >>>>>> Hi, >> >>>>>> >> >>>>>> I'm interested in HSA and excited when I found AMD's fully >> >>>>>> open-stack ROCm supporting it. Before digging into the code, I >> >>>>>> wonder if there's any documentation available about AMD's HSA >> >>>>>> implementation, either book, whitepaper, paper, or documentation. >> >>>>>> >> >>>>>> I did find helpful materials about HSA, including HSA standards >> >>>>>> on this page >> >>>>>> (http://www.hsafoundation.com/standards/) and a nice book about >> >>HSA >> >>>>>> (Heterogeneous System Architecture A New Compute Platform >> >>Infrastructure). >> >>>>>> But regarding the documentation about AMD's implementation, I >> >>>>>> haven't found anything yet. >> >>>>>> >> >>>>>> Please let me know if there are ones publicly accessible. If no, >> >>>>>> any suggestions on learning the implementation of specific system >> >>>>>> components, e.g., queue scheduling. >> >>>>>> >> >>>>>> Best, >> >>>>>> Mark >> >>> >> >>_______________________________________________ >> >>amd-gfx mailing list >> >>amd-gfx at lists.freedesktop.org >> >>https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >_______________________________________________ >> >amd-gfx mailing list >> >amd-gfx at lists.freedesktop.org >> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx