One more thing to add, if allocate the ringbuffer not from stolen memory but normal memory, issue is gone. static int intel_alloc_ringbuffer_obj(struct drm_device *dev, struct intel_ringbuffer *ringbuf) { struct drm_i915_gem_object *obj; obj = NULL; #if 0 if (!HAS_LLC(dev)) obj = i915_gem_object_create_stolen(dev, ringbuf->size); #endif if (obj == NULL) obj = i915_gem_alloc_object(dev, ringbuf->size); if (obj == NULL) return -ENOMEM; /* mark ring buffers as read-only from GPU side by default */ obj->gt_ro = 1; ringbuf->obj = obj; return 0; } Can anyone give me some directions to check, thanks! -James -----Original Message----- From: Tang, Jun Sent: Friday, July 1, 2016 1:14 PM To: 'intel-gfx@xxxxxxxxxxxxxxxxxxxxx' <intel-gfx@xxxxxxxxxxxxxxxxxxxxx> Subject: GPU hang with high media workload on BSW Hi Guys, Thanks for the help in advanced! I'm encountering a GPU hang issue while running multiple channel H264 video decoding + VPP composition, display and also one channel H264 encoding on BSW. It's a render ring stuck like below: [58503.223700] [drm] stuck on render ring [58503.246340] [drm] GPU HANG: ecode 8:0:0x7f1d7e3d, in Challenge [3259], reason: Ring hung, action: reset There is a part of the /sys/class/drm/card0/error as below, I suspect the hang is caused by the incorrect render ring buffer content: In below line with 'where I suspect', the value of ring buffer is 18800001 (MI_BATCH_BUFFER_START_GEN8), but the next DWORD is 00100002. Since MI_BATCH_BUFFER_START_GEN8 should be followed by batch buffer address, I think the content of ring buffer is not correct. ==========part of the /sys/class/drm/card0/error========= render ring --- 3 requests seqno 0x020dc83a, emitted 4353167966, tail 0x00000070 seqno 0x020dc83b, emitted 4353167969, tail 0x000000f0 seqno 0x020dc83e, emitted 4353167982, tail 0x00000170 render ring --- ringbuffer = 0x00015000 00000000 : 18800001 // where I suspect 00000004 : 00100002 // where I suspect 00000008 : 00000000 0000000c : 00000000 00000010 : 00000000 00000014 : 00000000 00000018 : 7a000004 0000001c : 01144c1c 00000020 : 00036080 00000024 : 00000000 00000028 : 00000000 0000002c : 00000000 00000030 : 04000000 00000034 : 00000000 00000038 : 0c000000 0000003c : 1382c10c ==========part of the /sys/class/drm/card0/error========= To identify when the ring buffer is incorrectly programmed, I added some code to read the first DWORD of ring buffer back after intel_ring_emit in gen8_emit_pipe_control while tail of ring buffer is zero. The result is: the read-back first DWORD of ring buffer is sometimes different from the data intel_ring_emit just writes when tail is 0. And just after this, GPU hang may happen. Here is the output of my print: [ 3409.067402] rcs b:0x18800001 d:0x7a000004 t:0 'b' - ioread32 (ringbuf->virtual_start) 'd' - intel_ring_emit wants to write 't' - the value of tail I'm aware that ringbuf->virtual_start is write combine, the read may led to write-combine buffer flush and slow read performance. But don't know why it's different from the value intel_ring_emit just writes? Another test, when the value read back is not correct, I wrote it again. Then read back again, most of the time, it will become correct. Thanks a lot! -James _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx