Thanks, John! On Sat, Mar 17, 2018 at 4:17 PM, Bridgman, John <John.Bridgman at amd.com> wrote: > >>-----Original Message----- >>From: Ming Yang [mailto:minos.future at gmail.com] >>Sent: Saturday, March 17, 2018 12:35 PM >>To: Kuehling, Felix; Bridgman, John >>Cc: amd-gfx at lists.freedesktop.org >>Subject: Re: Documentation about AMD's HSA implementation? >> >>Hi, >> >>After digging into documents and code, our previous discussion about GPU >>workload scheduling (mainly HWS and ACE scheduling) makes a lot more >>sense to me now. Thanks a lot! I'm writing this email to ask more questions. >>Before asking, I first share a few links to the documents that are most helpful >>to me. >> >>GCN (1st gen.?) architecture whitepaper >>https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf >>Notes: ACE scheduling. >> >>Polaris architecture whitepaper (4th gen. GCN) >>http://radeon.com/_downloads/polaris-whitepaper-4.8.16.pdf >>Notes: ACE scheduling; HWS; quick response queue (priority assignment); >>compute units reservation. >> >>AMDKFD patch cover letters: >>v5: https://lwn.net/Articles/619581/ >>v1: https://lwn.net/Articles/605153/ >> >>A comprehensive performance analysis of HSA and OpenCL 2.0: >>http://ieeexplore.ieee.org/document/7482093/ >> >>Partitioning resources of a processor (AMD patent) >>https://patents.google.com/patent/US8933942B2/ >>Notes: Compute resources are allocated according to the resource >>requirement percentage of the command. >> >>Here come my questions about ACE scheduling: >>Many of my questions are about ACE scheduling because the firmware is >>closed-source and how ACE schedules commands (queues) is not detailed >>enough in these documents. I'm not able to run experiments on Raven Ridge >>yet. >> >>1. Wavefronts of one command scheduled by an ACE can be spread out to >>multiple compute engines (shader arrays)? This is quite confirmed by the >>cu_mask setting, as cu_mask for one queue can cover CUs over multiple >>compute engines. > > Correct, assuming the work associated with the command is not trivially small > and so generates enough wavefronts to require multiple CU's. > >> >>2. If so, how is the competition resolved between commands scheduled by >>ACEs? What's the scheduling scheme? For example, when each ACE has a >>command ready to occupy 50% compute resources, are these 4 commands >>each occupies 25%, or they execute in the round-robin with 50% resources at >>a time? Or just the first two scheduled commands execute and the later two >>wait? > > Depends on how you measure compute resources, since each SIMD in a CU can > have up to 10 separate wavefronts running on it as long as total register usage > for all the threads does not exceed the number available in HW. > > If each ACE (let's say pipe for clarity) has enough work to put a single wavefront > on 50% of the SIMDs then all of the work would get scheduled to the SIMDs (4 > SIMDs per CU) and run in a round-robin-ish manner as each wavefront was > blocked waiting for memory access. > > If each pipe has enough work to fill 50% of the CPUs and all pipes/queues were > assigned the same priority (see below) then the behaviour would be more like > "each one would get 25% and each time a wavefront finished another one would > be started". > This makes sense to me. I will try some experiments once Raven Ridge is ready. >> >>3. If the barrier bit of the AQL packet is not set, does ACE schedule the >>following command using the same scheduling scheme in #2? > > Not sure, barrier behaviour has paged so far out of my head that I'll have to skip > this one. > This barrier bit is defined in HSA. If it is set, the following packet should wait until the current packet finish. It's probably the key implementing out-of-order execution of OpenCL, I'm not sure. I should be able to use the profiler to find out the answer once I can run OpenCL on Raven Ridge. >> >>4. ACE takes 3 pipe priorities: low, medium, and high, even though AQL queue >>has 7 priority levels, right? > > Yes-ish. Remember that there are multiple levels of scheduling going on here. At > any given time a pipe is only processing work from one of the queues; queue > priorities affect the pipe's round-robin-ing between queues in a way that I have > managed to forget (but will try to find). There is a separate pipe priority, which > IIRC is actually programmed per queue and takes effect when the pipe is active > on that queue. There is also a global (IIRC) setting which adjusts how compute > work and graphics work are prioritized against each other, giving options like > making all compute lower priority than graphics or making only high priority > compute get ahead of graphics. > > I believe the pipe priority is also referred to as SPI priority, since it affects > the way SPI decides which pipe (graphics/compute) to accept work from > next. > > This is all a bit complicated by a separate (global IIRC) option which randomizes > priority settings in order to avoid deadlock in certain conditions. We used to > have that enabled by default (believe it was needed for specific OpenCL > programs) but not sure if it is still enabled - if so then most of the above gets > murky because of the randomization. > > At first glance we do not enable randomization for Polaris or Vega but do for > all of the older parts. Haven't looked at Raven yet. Thanks for providing these details! > >> >>5. Is this patent (https://patents.google.com/patent/US8933942B2/) >>implemented? How to set resource allocation percentage for >>commands/queues? > > I don't remember seeing that being implemented in the drivers. > >> >>If these features work well, I have confidence in AMD GPUs of providing very >>nice real-time predictability. >> >> >>Thanks, >>Ming Thanks, Ming