[PATCH 1/2] drm/amdgpu: use multipipe compute policy on non PL11 asics

felix.kuehling@xxxxxxx (Felix Kuehling) · Tue, 7 Nov 2017 17:27:44 -0500

Not sure amd-gfx is the right list for this. The code I'm talking about
is not upstream yet, and the problem being discussed occurs on the DKMS
branch.

Even though we say we don't support Tonga (there are known issues with
ROCm on Tonga), KFD will detect and initialize it. That means the HWS is
initialized. If user mode initializes the ROCr Runtime (even if it's
just for device enumeration), I believe at least one user mode queue is
created, that will just sit idle. That idle user mode queue is probably
on pipe 0. The pipe will periodically switch between the idle user mode
queue and a kernel compute queue.

You can see a hexdump of MQDs of all KFD user mode queues in
/sys/kernel/debug/kfd/mqds. You can also inspect the HQDs in
/sys/kernel/debug/kfd/hqds.

Regards,
Â  Felix

On 2017-11-07 04:23 AM, Zhou, David(ChunMing) wrote:
>
> I got the infomation about this issue:
>
> "
>
> # If I install #491963 (failed in the report) with â??./amdgpu-pro-install
> -y --opencl=legacyâ?? command, the test passed. It failed when rocm is
> also installed with â??./amdgpu-pro-install -y --opencl=legacy,rocmâ??
> command."
>
> So I guess the hung is related to the pipe is used both ORCA and rocm.
> But Felix said they dont support rocm on Tonga, that could mean this
> issue doesn't matter currently.
>
>
> Regards,
>
> DavidÂ  Zhou
>
> ------------------------------------------------------------------------
> *From:* Andres Rodriguez <andresx7 at gmail.com>
> *Sent:* Tuesday, November 7, 2017 3:26:38 PM
> *To:* Zhou, David(ChunMing)
> *Cc:* amd-gfx list; Deucher, Alexander
> *Subject:* Re: [PATCH 1/2] drm/amdgpu: use multipipe compute policy on
> non PL11 asics
> Â 
> Do you have any work actually going into multiple pipes? My
> understanding is that opencl will only use one queue at a time (but
> I'm not really certain about that).
>
> What you can also check is if the app works correctly when it executed
> on pipe0, and if it hangs on pipe 1+. I removed all the locations
> where pipe0 was hardcoded in the open driver, but it is possible it is
> still hardcoded somewhere on the closed stack.
>
> Regards,
> AndresÂ 
>
> On Nov 6, 2017 10:19 PM, "Zhou, David(ChunMing)" <David1.Zhou at amd.com
> <mailto:David1.Zhou at amd.com>> wrote:
>
>     Then snychronization should have no problem, it maybe relate to
>     multipipe hw setting issue.
>
>
>     Regards,
>
>     David Zhou
>
>     ------------------------------------------------------------------------
>     *From:* Andres Rodriguez <andresx7 at gmail.com
>     <mailto:andresx7 at gmail.com>>
>     *Sent:* Tuesday, November 7, 2017 2:00:57 AM
>     *To:* Zhou, David(ChunMing); amd-gfx list
>     *Cc:* Deucher, Alexander
>     *Subject:* Re: [PATCH 1/2] drm/amdgpu: use multipipe compute
>     policy on non PL11 asics
>     Â 
>     Sorry my mail client seems to have blown up. My reply got cut off,
>     here is the full version:
>
>
>
>     On 2017-11-06 01:49 AM, Chunming Zhou wrote:
>     > Hi Andres,
>     >
>     Hi David,
>
>     > With your this patch, OCLperf hung.
>     Is this on all ASICs or just a specific one?
>
>     >
>     > Could you explain more?
>     >
>     > If I am correctly, the difference of with and without this patch is
>     > setting first two queue or setting all queues of pipe0 to
>     queue_bitmap.
>     It is slightly different. With this patch we will also use the first
>     two queues of all pipes, not just pipe 0;
>
>     Pre-patch:
>
>     |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-|
>     Â 11111111Â  00000000Â  00000000Â  00000000
>
>     Post-patch:
>
>     |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-|
>     Â 11000000Â  11000000Â  11000000Â  11000000
>
>     What this means is that we are allowing real multithreading for
>     compute. Jobs on different pipes allow for parallel execution of work.
>     Jobs on the same pipe (but different queues) use timeslicing to share
>     the hardware.
>
>
>     >
>     > Then UMD can use different number queue to submit command for
>     compute
>     > selected by amdgpu_queue_mgr_map.
>     >
>     > I checked amdgpu_queue_mgr_map implementation,Â  CS_IOCTL can map
>     user
>     > ring to different hw ring depending on busy or idle, right?
>     Yes, when a queue is first used, amdgpu_queue_mgr_map will decide what
>     the mapping is for a usermode ring to a kernel ring id.
>
>     > If yes, I see a bug in it, which will result in our sched_fence not
>     > work. Our sched fence assumes the job will be executed in order,
>     your
>     > mapping queue breaks this.
>
>     I think here you mean that work will execute out of order because it
>     will go to different rings?
>
>     That should not happen, since the id mapping is permanent on a
>     per-context basis. Once a mapping is decided, it will be cached for
>     this context so that we keep execution order guarantees. See the
>     id-caching code in amdgpu_queue_mgr.c for reference.
>
>     As long as the usermode keeps submitting work to the same ring, it
>     will all be executed in order (all in the same ring). There is no
>     change in this guarantee compared to pre-patch. Note that even before
>     this patch amdgpu_queue_mgr_map has been using an LRU policy for a
>     long time now.
>
>     Regards,
>     Andres
>
>     On Mon, Nov 6, 2017 at 12:44 PM, Andres Rodriguez
>     <andresx7 at gmail.com <mailto:andresx7 at gmail.com>> wrote:
>     >
>     >
>     > On 2017-11-06 01:49 AM, Chunming Zhou wrote:
>     >>
>     >> Hi Andres,
>     >>
>     >
>     > Hi David,
>     >
>     >> With your this patch, OCLperf hung.
>     >
>     >
>     > Is this on all ASICs or just a specific one?
>     >
>     >>
>     >> Could you explain more?
>     >>
>     >> If I am correctly, the difference of with and without this patch is
>     >> setting first two queue or setting all queues of pipe0 to
>     queue_bitmap.
>     >
>     >
>     > It is slightly different. With this patch we will also use the
>     first two
>     > queues of all pipes, not just pipe 0;
>     >
>     > Pre-patch:
>     >
>     > |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-|
>     >Â  11111111Â  00000000Â  00000000Â  00000000
>     >
>     > Post-patch:
>     >
>     > |-Pipe 0-||-Pipe 1-||-Pipe 2-||-Pipe 3-|
>     >Â  11000000Â  11000000Â  11000000Â  11000000
>     >
>     > What this means is that we are allowing real multithreading for
>     compute.
>     > Jobs on different pipes allow for parallel execution of work.
>     Jobs on the
>     > same pipe (but different queues) use timeslicing to share the
>     hardware.
>     >
>     >
>     >
>     >>
>     >> Then UMD can use different number queue to submit command for
>     compute
>     >> selected by amdgpu_queue_mgr_map.
>     >>
>     >> I checked amdgpu_queue_mgr_map implementation,Â  CS_IOCTL can
>     map user ring
>     >> to different hw ring depending on busy or idle, right?
>     >>
>     >> If yes, I see a bug in it, which will result in our sched_fence
>     not work.
>     >> Our sched fence assumes the job will be executed in order, your
>     mapping
>     >> queue breaks this.
>     >>
>     >>
>     >> Regards,
>     >>
>     >> David Zhou
>     >>
>     >>
>     >> On 2017å¹´09æ??27æ?¥ 00:22, Andres Rodriguez wrote:
>     >>>
>     >>> A performance regression for OpenCL tests on Polaris11 had
>     this feature
>     >>> disabled for all asics.
>     >>>
>     >>> Instead, disable it selectively on the affected asics.
>     >>>
>     >>> Signed-off-by: Andres Rodriguez <andresx7 at gmail.com
>     <mailto:andresx7 at gmail.com>>
>     >>> ---
>     >>>Â Â  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 14 ++++++++++++--
>     >>>Â Â  1 file changed, 12 insertions(+), 2 deletions(-)
>     >>>
>     >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>     >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>     >>> index 4f6c68f..3d76e76 100644
>     >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>     >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>     >>> @@ -109,9 +109,20 @@ void amdgpu_gfx_parse_disable_cu(unsigned
>     *mask,
>     >>> unsigned max_se, unsigned max_s
>     >>>Â Â Â Â Â Â  }
>     >>>Â Â  }
>     >>> +static bool amdgpu_gfx_is_multipipe_capable(struct
>     amdgpu_device *adev)
>     >>> +{
>     >>> +Â Â Â  /* FIXME: spreading the queues across pipes causes perf
>     regressions
>     >>> +Â Â Â Â  * on POLARIS11 compute workloads */
>     >>> +Â Â Â  if (adev->asic_type == CHIP_POLARIS11)
>     >>> +Â Â Â Â Â Â Â  return false;
>     >>> +
>     >>> +Â Â Â  return adev->gfx.mec.num_mec > 1;
>     >>> +}
>     >>> +
>     >>>Â Â  void amdgpu_gfx_compute_queue_acquire(struct amdgpu_device
>     *adev)
>     >>>Â Â  {
>     >>>Â Â Â Â Â Â  int i, queue, pipe, mec;
>     >>> +Â Â Â  bool multipipe_policy =
>     amdgpu_gfx_is_multipipe_capable(adev);
>     >>>Â Â Â Â Â Â  /* policy for amdgpu compute queue ownership */
>     >>>Â Â Â Â Â Â  for (i = 0; i < AMDGPU_MAX_COMPUTE_QUEUES; ++i) {
>     >>> @@ -125,8 +136,7 @@ void amdgpu_gfx_compute_queue_acquire(struct
>     >>> amdgpu_device *adev)
>     >>>Â Â Â Â Â Â Â Â Â Â  if (mec >= adev->gfx.mec.num_mec)
>     >>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â  break;
>     >>> -Â Â Â Â Â Â Â  /* FIXME: spreading the queues across pipes causes perf
>     >>> regressions */
>     >>> -Â Â Â Â Â Â Â  if (0) {
>     >>> +Â Â Â Â Â Â Â  if (multipipe_policy) {
>     >>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â  /* policy: amdgpu owns the first two queues of
>     the first
>     >>> MEC */
>     >>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â  if (mec == 0 && queue < 2)
>     >>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  set_bit(i, adev->gfx.mec.queue_bitmap);
>     >>
>     >>
>     >
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx