On Wed, 2017-11-29 at 16:58 -0500, Felix Kuehling wrote: > You can see the state of the queues in debugfs: > /sys/kernel/debug/kfd/... You can look at MQDs and HQDs. thanks. how do I decode the information? The rptr always stops at pos 60 which looks like this in mqds: DIQ on device 45a2 00000000: c0310800 00004000 00000000 00000000 00000000 00000000 00000000 00000000 00000020: 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000040: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ffffffff 00000060: ffffffff 00000000 ffffffff ffffffff 00000000 00000000 00000000 00000000 If I understood correctly that's the queue dump, so those fffffs look wrong > > If your application isn't stopping queues deliberately, queues get > disabled by evictions, usually temporarily. You'll see kernel messages > when that happens. > > A VM fault will result in queues of the offending process getting > disabled permanently. Again, you'll see messages about that in the > kernel log. > > The RPTR can also stop advancing if you have an infinite loop in a > shader program, or just a shader that takes a very long time to execute. > Or maybe if you have some dependencies (barriers) in your AQL packets > that never get satisfied. > > The function you changed only affects the HIQ, the queue that KFD uses > to control the HWS. It does not affect user mode queues. If your problem > is with a user mode queue, your change should have no effect at all. It's not a userspace queue that stops. I'm using kernel dbgdev to issue wave_resume commands. (waves are halted after executing s_sendmsg_halt). I bumped KFD_KERNEL_QUEUE_SIZE to 16KB to make sure all 320 resume commads fit (otherwise I get spurious ENOMEM when the queue is full but still advancing). thanks, Jan > > Regards, > Felix > > > On 2017-11-29 04:43 PM, Jan Vesely wrote: > > On Mon, 2017-11-20 at 14:22 -0500, Felix Kuehling wrote: > > > I think this patch is not correct. The EOP-mem is not associated with > > > the queue size. The EOP buffer is a separate buffer used by the firmware > > > to handle command completion. As I understand it, this allows more > > > concurrency, while still making it look like all commands in the queue > > > are completing in order. > > > > thanks for the explanation. I was looking for a source of a CP hang > > (rptr stops advancing), but bumping the eop size actually mode things > > worse. Is there a way to find out if a queue got disabled and for what > > reason? (I'm running ROCK-1.6.x based kernel) > > > > thanks, > > Jan > > > > > Regards, > > > Felix > > > > > > > > > On 2017-11-19 03:19 AM, Oded Gabbay wrote: > > > > On Thu, Nov 16, 2017 at 11:36 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote: > > > > > Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu> > > > > > --- > > > > > drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c | 5 +++-- > > > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c > > > > > index f1d48281e322..b3bee39661ab 100644 > > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c > > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c > > > > > @@ -37,15 +37,16 @@ static bool initialize_vi(struct kernel_queue *kq, struct kfd_dev *dev, > > > > > enum kfd_queue_type type, unsigned int queue_size) > > > > > { > > > > > int retval; > > > > > + unsigned int size = ALIGN(queue_size, PAGE_SIZE); > > > > > > > > > > - retval = kfd_gtt_sa_allocate(dev, PAGE_SIZE, &kq->eop_mem); > > > > > + retval = kfd_gtt_sa_allocate(dev, size, &kq->eop_mem); > > > > > if (retval != 0) > > > > > return false; > > > > > > > > > > kq->eop_gpu_addr = kq->eop_mem->gpu_addr; > > > > > kq->eop_kernel_addr = kq->eop_mem->cpu_ptr; > > > > > > > > > > - memset(kq->eop_kernel_addr, 0, PAGE_SIZE); > > > > > + memset(kq->eop_kernel_addr, 0, size); > > > > > > > > > > return true; > > > > > } > > > > > -- > > > > > 2.13.6 > > > > > > > > > > _______________________________________________ > > > > > amd-gfx mailing list > > > > > amd-gfx at lists.freedesktop.org > > > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > > > > > Thanks! > > > > Applied to -next tree > > > > Oded > > > > _______________________________________________ > > > > amd-gfx mailing list > > > > amd-gfx at lists.freedesktop.org > > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171130/d26e9b6a/attachment.sig>