Documentation about AMD's HSA implementation?

minos.future@xxxxxxxxx (Ming Yang) · Wed, 14 Feb 2018 01:05:04 -0500

Thanks for all the inputs.  Very helpful!  I think I have a general
understanding of the queue scheduling now and it's time for me to read
more code and materials and do some experiments.

I'll come back with more questions hopefully. :-)

Hi David, please don't hesitate to share more documents.  I might find
helpful information from them eventually.  People like me may benefit
from them someway in the future.

Best,
Ming (Mark)

On Tue, Feb 13, 2018 at 7:14 PM, Panariti, David <David.Panariti at amd.com> wrote:
> I found a bunch of doc whilst spelunking info for another project.
> I'm not sure what's up-to-date, correct, useful, etc.
> I've attached one.
> Let me know if you want any more.
>
> davep
>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>> Of Bridgman, John
>> Sent: Tuesday, February 13, 2018 6:45 PM
>> To: Bridgman, John <John.Bridgman at amd.com>; Ming Yang
>> <minos.future at gmail.com>; Kuehling, Felix <Felix.Kuehling at amd.com>
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; amd-
>> gfx at lists.freedesktop.org
>> Subject: RE: Documentation about AMD's HSA implementation?
>>
>>
>> >-----Original Message-----
>> >From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>> >Of Bridgman, John
>> >Sent: Tuesday, February 13, 2018 6:42 PM
>> >To: Ming Yang; Kuehling, Felix
>> >Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>> >Subject: RE: Documentation about AMD's HSA implementation?
>> >
>> >
>> >
>> >>-----Original Message-----
>> >>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On
>> Behalf
>> >>Of Ming Yang
>> >>Sent: Tuesday, February 13, 2018 4:59 PM
>> >>To: Kuehling, Felix
>> >>Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>> >>Subject: Re: Documentation about AMD's HSA implementation?
>> >>
>> >>That's very helpful, thanks!
>> >>
>> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>> >><felix.kuehling at amd.com>
>> >>wrote:
>> >>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>> >>>> Thanks for the suggestions!  But I might ask several specific
>> >>>> questions, as I can't find the answer in those documents, to give
>> >>>> myself a quick start if that's okay. Pointing me to the
>> >>>> files/functions would be good enough.  Any explanations are
>> >>>> appreciated.   My purpose is to hack it with different scheduling
>> >>>> policy with real-time and predictability consideration.
>> >>>>
>> >>>> - Where/How is the packet scheduler implemented?  How are packets
>> >>>> from multiple queues scheduled?  What about scheduling packets from
>> >>>> queues in different address spaces?
>> >>>
>> >>> This is done mostly in firmware. The CP engine supports up to 32
>> queues.
>> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> >>> micro engine. Within each pipe the queues are time-multiplexed.
>> >>
>> >>Please correct me if I'm wrong.  CP is computing processor, like the
>> >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
>> >>scheduler multiplexing queues, in order to hide memory latency.
>> >
>> >CP is one step back from that - it's a "command processor" which reads
>> >command packets from driver (PM4 format) or application (AQL format)
>> >then manages the execution of each command on the GPU. A typical
>> packet
>> >might be "dispatch", which initiates a compute operation on an
>> >N-dimensional array, or "draw" which initiates the rendering of an
>> >array of triangles. Those compute and render commands then generate a
>> >(typically) large number of wavefronts which are multiplexed on the
>> >shader core (by SQ IIRC). Most of our recent GPUs have one micro engine
>> >for graphics ("ME") and two for compute ("MEC"). Marketing refers to each
>> pipe on an MEC block as an "ACE".
>>
>> I missed one important point - "CP" refers to the combination of ME, MEC(s)
>> and a few other related blocks.
>>
>> >>
>> >>>
>> >>> If we need more than 24 queues, or if we have more than 8 processes,
>> >>> the hardware scheduler (HWS) adds another layer scheduling,
>> >>> basically round-robin between batches of 24 queues or 8 processes.
>> >>> Once you get into such an over-subscribed scenario your performance
>> >>> and GPU utilization can suffers quite badly.
>> >>
>> >>HWS is also implemented in the firmware that's closed-source?
>> >
>> >Correct - HWS is implemented in the MEC microcode. We also include a
>> >simple SW scheduler in the open source driver code, however.
>> >>
>> >>>
>> >>>>
>> >>>> - I noticed the new support of concurrency of multi-processes in
>> >>>> the archive of this mailing list.  Could you point me to the code
>> >>>> that implements this?
>> >>>
>> >>> That's basically just a switch that tells the firmware that it is
>> >>> allowed to schedule queues from different processes at the same time.
>> >>> The upper limit is the number of VMIDs that HWS can work with. It
>> >>> needs to assign a unique VMID to each process (each VMID
>> >>> representing a separate address space, page table, etc.). If there
>> >>> are more processes than VMIDs, the HWS has to time-multiplex.
>> >>
>> >>HWS dispatch packets in their order of becoming the head of the queue,
>> >>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>> >>round-robin between queues? You mentioned round-robin over batches
>> in
>> >>the over- subscribed scenario.
>> >
>> >Round robin between sets of queues. The HWS logic generates sets as
>> >follows:
>> >
>> >1. "set resources" packet from driver tells scheduler how many VMIDs
>> >and HW queues it can use
>> >
>> >2. "runlist" packet from driver provides list of processes and list of
>> >queues for each process
>> >
>> >3. if multi-process switch not set, HWS schedules as many queues from
>> >the first process in the runlist as it has HW queues (see #1)
>> >
>> >4. at the end of process quantum (set by driver) either switch to next
>> >process (if all queues from first process have been scheduled) or
>> >schedule next set of queues from the same process
>> >
>> >5. when all queues from all processes have been scheduled and run for a
>> >process quantum, go back to the start of the runlist and repeat
>> >
>> >If the multi-process switch is set, and the number of queues for a
>> >process is less than the number of HW queues available, then in step #3
>> >above HWS will start scheduling queues for additional processes, using
>> >a different VMID for each process, and continue until it either runs
>> >out of VMIDs or HW queues (or reaches the end of the runlist). All of
>> >the queues and processes would then run together for a process quantum
>> before switching to the next queue set.
>> >
>> >>
>> >>This might not be a big deal for performance, but it matters for
>> >>predictability and real-time analysis.
>> >
>> >Agreed. In general you would not want to overcommit either VMIDs or HW
>> >queues in a real-time scenario, and for hard real time you would
>> >probably want to limit to a single queue per pipe since the MEC also
>> >multiplexes between HW queues on a pipe even without HWS.
>> >
>> >>
>> >>>
>> >>>>
>> >>>> - Also another related question -- where/how is the
>> >>>> preemption/context switch between packets/queues implemented?
>> >>>
>> >>> As long as you don't oversubscribe the available VMIDs, there is no
>> >>> real context switching. Everything can run concurrently. When you
>> >>> start oversubscribing HW queues or VMIDs, the HWS firmware will
>> >>> start multiplexing. This is all handled inside the firmware and is
>> >>> quite transparent even to KFD.
>> >>
>> >>I see.  So the preemption in at least AMD's implementation is not
>> >>switching out the executing kernel, but just letting new kernels to
>> >>run concurrently with the existing ones.  This means the performance
>> >>is degraded when too many workloads are submitted.  The running
>> >>kernels leave the GPU only when they are done.
>> >
>> >Both - you can have multiple kernels executing concurrently (each
>> >generating multiple threads in the shader core) AND switch out the
>> >currently executing set of kernels via preemption.
>> >
>> >>
>> >>Is there any reason for not preempting/switching out the existing
>> >>kernel, besides context switch overheads?  NVIDIA is not providing
>> >>this
>> >option either.
>> >>Non-preemption hurts the real-time property in terms of priority
>> >>inversion.  I understand preemption should not be massively used but
>> >>having such an option may help a lot for real-time systems.
>> >
>> >If I understand you correctly, you can have it either way depending on
>> >the number of queues you enable simultaneously. At any given time you
>> >are typically only going to be running the kernels from one queue on
>> >each pipe, ie with 3 pipes and 24 queues you would typically only be
>> >running 3 kernels at a time. This seemed like a good compromise between
>> scalability and efficiency.
>> >
>> >>
>> >>>
>> >>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>> >>> queue). It supports packets for unmapping queues, we can send it a
>> >>> new runlist (basically a bunch of map-process and map-queue packets).
>> >>> The interesting files to look at are kfd_packet_manager.c,
>> >>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>> >>>
>> >>
>> >>So in this way, if we want to implement different scheduling policy,
>> >>we should control the submission of packets to the queues in
>> >>runtime/KFD, before getting to the firmware.  Because it's out of
>> >>access once it's submitted to the HWS in the firmware.
>> >
>> >Correct - there is a tradeoff between "easily scheduling lots of work"
>> >and fine- grained control. Limiting the number of queues you run
>> >simultaneously is another way of taking back control.
>> >
>> >You're probably past this, but you might find the original introduction
>> >to KFD useful in some way:
>> >
>> >https://lwn.net/Articles/605153/
>> >
>> >>
>> >>Best,
>> >>Mark
>> >>
>> >>> Regards,
>> >>>   Felix
>> >>>
>> >>>>
>> >>>> Thanks in advance!
>> >>>>
>> >>>> Best,
>> >>>> Mark
>> >>>>
>> >>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling
>> >>>>> <felix.kuehling at amd.com>
>> >>wrote:
>> >>>>> There is also this: https://gpuopen.com/professional-compute/,
>> >>>>> which give pointer to several libraries and tools that built on
>> >>>>> top of
>> >ROCm.
>> >>>>>
>> >>>>> Another thing to keep in mind is, that ROCm is diverging from the
>> >>>>> strict HSA standard in some important ways. For example the HSA
>> >>>>> standard includes HSAIL as an intermediate representation that
>> >>>>> gets finalized on the target system, whereas ROCm compiles
>> >>>>> directly to native
>> >>GPU ISA.
>> >>>>>
>> >>>>> Regards,
>> >>>>>   Felix
>> >>>>>
>> >>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
>> >><Alexander.Deucher at amd.com> wrote:
>> >>>>>> The ROCm documentation is probably a good place to start:
>> >>>>>>
>> >>>>>> https://rocm.github.io/documentation.html
>> >>>>>>
>> >>>>>>
>> >>>>>> Alex
>> >>>>>>
>> >>>>>> ________________________________
>> >>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf
>> >of
>> >>>>>> Ming Yang <minos.future at gmail.com>
>> >>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>> >>>>>> To: amd-gfx at lists.freedesktop.org
>> >>>>>> Subject: Documentation about AMD's HSA implementation?
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'm interested in HSA and excited when I found AMD's fully
>> >>>>>> open-stack ROCm supporting it. Before digging into the code, I
>> >>>>>> wonder if there's any documentation available about AMD's HSA
>> >>>>>> implementation, either book, whitepaper, paper, or documentation.
>> >>>>>>
>> >>>>>> I did find helpful materials about HSA, including HSA standards
>> >>>>>> on this page
>> >>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>> >>HSA
>> >>>>>> (Heterogeneous System Architecture A New Compute Platform
>> >>Infrastructure).
>> >>>>>> But regarding the documentation about AMD's implementation, I
>> >>>>>> haven't found anything yet.
>> >>>>>>
>> >>>>>> Please let me know if there are ones publicly accessible. If no,
>> >>>>>> any suggestions on learning the implementation of specific system
>> >>>>>> components, e.g., queue scheduling.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Mark
>> >>>
>> >>_______________________________________________
>> >>amd-gfx mailing list
>> >>amd-gfx at lists.freedesktop.org
>> >>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> >_______________________________________________
>> >amd-gfx mailing list
>> >amd-gfx at lists.freedesktop.org
>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx