Bug ID | 103304 |
---|---|
Summary | multi-threaded usage of Gallium RadeonSI leads to NULL pointer exception in pb_cache_reclaim_buffer |
Product | Mesa |
Version | 17.0 |
Hardware | x86-64 (AMD64) |
OS | Linux (All) |
Status | NEW |
Severity | normal |
Priority | medium |
Component | Drivers/Gallium/radeonsi |
Assignee | dri-devel@lists.freedesktop.org |
Reporter | lper.home@gmail.com |
QA Contact | dri-devel@lists.freedesktop.org |
Issue is not present in Mesa 11.X. It is however present in Mesa 13.0.X, 17.0.X and as far as I can see in the code, it is probably as well present in latest Mesa 17.2.X. Our code is very similar as the second example in https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading : we have two contexts which are shared. In one context/thread the rendering is done and in the other context/thread the texture uploading is done. It is in this case we hit the race causing a crash (on average we need about an hour to hit the issue). The crash has following footprint: Program terminated with signal SIGSEGV, Segmentation fault. #0 pb_cache_reclaim_buffer (mgr=mgr@entry=0x1e8dd30, size=size@entry=2088960, alignment=alignment@entry=4096, usage=usage@entry=20, bucket_index=bucket_index@entry=3) at pipebuffer/pb_cache.c:183 #1 0x00007fe2671c50e7 in amdgpu_bo_create (rws=0x1e8dbf0, size=<optimized out>, alignment=4096, domain=RADEON_DOMAIN_VRAM_GTT, flags=RADEON_FLAG_GTT_WC) at amdgpu_bo.c:754 #2 0x00007fe2671db666 in r600_alloc_resource (rscreen=rscreen@entry=0x1e8f0c0, res=res@entry=0x7fe24c2d3100) at r600_buffer_common.c:197 #3 0x00007fe2671e6eff in r600_texture_invalidate_storage (rctx=rctx@entry=0x1f9e900, rtex=rtex@entry=0x7fe24c2d3100) at r600_texture.c:1414 #4 0x00007fe2671eb474 in r600_texture_transfer_map (ctx=0x1f9e900, texture=0x7fe24c2d3100, level=0, usage=258, box=0x7fe265bca970, ptransfer=0x7fe265bca898) at r600_texture.c:1483 #5 0x00007fe267041807 in u_transfer_map_vtbl (context=<optimized out>, resource=<optimized out>, level=<optimized out>, usage=<optimized out>, box=<optimized out>, transfer=<optimized out>) at util/u_transfer.c:138 #6 0x00007fe267041732 in u_default_texture_subdata (pipe=0x1f9e900, resource=0x7fe24c2d3100, level=<optimized out>, usage=<optimized out>, box=0x7fe265bca970, data="" stride=1920, layer_stride=2088960) at util/u_transfer.c:59 #7 0x00007fe266e51137 in st_TexSubImage (ctx=<optimized out>, dims=2, texImage=<optimized out>, xoffset=0, yoffset=0, zoffset=0, width=1920, height=1088, depth=1, format=6403, type=5121, pixels=0x7fe218ac05e0, unpack=0x2000fc0) at state_tracker/st_cb_texture.c:1412 #8 0x00007fe266dd75bf in _mesa_texture_sub_image (ctx=ctx@entry=0x1fe5d50, dims=dims@entry=2, texObj=texObj@entry=0x7fe24c2d2ca0, texImage=0x7fe24c2cda20, target=target@entry=3553, level=level@entry=0, xoffset=xoffset@entry=0, yoffset=yoffset@entry=0, zoffset=zoffset@entry=0, width=width@entry=1920, height=height@entry=1088, depth=depth@entry=1, format=format@entry=6403, type=type@entry=5121, pixels=pixels@entry=0x7fe218ac05e0, dsa=dsa@entry=false) at main/teximage.c:3239 #9 0x00007fe266dd7787 in texsubimage (ctx=0x1fe5d50, dims=dims@entry=2, target=3553, level=0, xoffset=0, yoffset=0, zoffset=zoffset@entry=0, width=1920, height=1088, depth=depth@entry=1, format=format@entry=6403, type=type@entry=5121, pixels=pixels@entry=0x7fe218ac05e0, callerName=callerName@entry=0x7fe26723c036 "glTexSubImage2D") at main/teximage.c:3297 #10 0x00007fe266dd7b49 in _mesa_TexSubImage2D (target=<optimized out>, level=<optimized out>, xoffset=<optimized out>, yoffset=<optimized out>, width=<optimized out>, height=<optimized out>, format=6403, type=5121, pixels=0x7fe218ac05e0) at main/teximage.c:3438 If we enable the assert() handling in the mesa3d library, then this crash will not occur, as an assert is triggered before: #0 0x00007fd388fed124 in raise () from /lib64/libc.so.6 #1 0x00007fd388fee58a in abort () from /lib64/libc.so.6 #2 0x00007fd388fe5e47 in ?? () from /lib64/libc.so.6 #3 0x00007fd388fe5ef2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007fd373986091 in pipe_reference_described (get_desc=<optimized out>, reference=0x7fd35801b100, ptr=0x0) at gallium/auxiliary/util/u_inlines.h:82 #5 pipe_reference (reference=0x7fd35801b100, ptr=0x0) at gallium/auxiliary/util/u_inlines.h:102 #6 pb_reference (src="" dst=0x2a260d0) at gallium/auxiliary/pipebuffer/pb_buffer.h:241 #7 amdgpu_winsys_bo_reference (src="" dst=0x2a260d0) at amdgpu_bo.h:116 #8 amdgpu_lookup_or_add_real_buffer (acs=0x3fea9d0, bo=0x7fd35801b100) at amdgpu_cs.c:358 #9 0x00007fd3739863ac in amdgpu_cs_add_buffer (rcs=<optimized out>, buf=<optimized out>, usage=10, domains=<optimized out>, priority=RADEON_PRIO_SAMPLER_TEXTURE) at amdgpu_cs.c:450 #10 0x00007fd3738d79fd in radeon_add_to_buffer_list (priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ, rbo=0x7fd358019cd0, ring=0x1eedeb8, rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:77 #11 radeon_add_to_buffer_list_check_mem (check_mem=false, priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ, rbo=0x7fd358019cd0, ring=0x1eedeb8, rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:114 #12 si_sampler_view_add_buffer (sctx=sctx@entry=0x1eedb60, resource=0x7fd358019cd0, usage=usage@entry=RADEON_USAGE_READ, is_stencil_sampler=<optimized out>, check_mem=check_mem@entry=false) at si_descriptors.c:316 #13 0x00007fd3738d7cb2 in si_sampler_views_begin_new_cs (sctx=sctx@entry=0x1eedb60, views=views@entry=0x1eef360) at si_descriptors.c:350 #14 0x00007fd3738dfd5a in si_all_descriptors_begin_new_cs (sctx=sctx@entry=0x1eedb60) at si_descriptors.c:2019 #15 0x00007fd3738e0983 in si_begin_new_cs (ctx=ctx@entry=0x1eedb60) at si_hw_context.c:227 #16 0x00007fd3738e14d3 in si_context_gfx_flush (context=0x1eedb60, flags=0, fence=0x0) at si_hw_context.c:162 #17 0x00007fd37399c2a7 in r600_flush_from_st (ctx=0x1eedb60, fence=0x0, flags=<optimized out>) at r600_pipe_common.c:381 #18 0x00007fd3735587ff in st_flush (st=st@entry=0x3e33870, fence=fence@entry=0x0, flags=flags@entry=0) at state_tracker/st_cb_flush.c:87 #19 0x00007fd37355881e in st_glFlush (ctx=<optimized out>) at state_tracker/st_cb_flush.c:121 #20 0x00007fd3733f7d71 in _mesa_flush (ctx=0x42cb4d0) at main/context.c:1838 #21 0x00007fd3733f8436 in _mesa_Flush () at main/context.c:1870 The thing that happens is a race between the texture uploading thread calling the r600_texture_invalidate_storage() and the glFlush call in the rendering thread calling the radeon_add_to_buffer_list() function: In the radeon_add_to_buffer_list following code is executed: return rctx->ws->cs_add_buffer( ring->cs, rbo->buf, (enum radeon_bo_usage)(usage | RADEON_USAGE_SYNCHRONIZED), rbo->domains, priority) * 4; While in the function r600_alloc_resource the following code is executed: /* Replace the pointer such that if res->buf wasn't NULL, it won't be * NULL. This should prevent crashes with multiple contexts using * the same buffer where one of the contexts invalidates it while * the others are using it. */ old_buf = res->buf; res->buf = new_buf; /* should be atomic */ Where both the rbo variable in radeon_add_to_buffer_list and res variable in r600_alloc_resource are the same thing. In the further processing of cs_add_buffer, the buffer is not linked anymore with the rbo as it has been swapped in the other thread! The r600_alloc_resource will decrease the buffer use reference so it gets zero, then causing the assert in the other thread (where the assert checks the reference count). Without the assert being enabled, the buf object will be cleaned up actually setting its prev/next pointer to NULL and causing a crash in pb_cache_reclaim_buffer when it is walking its bucket/cache list of buffers. We performed a couple of tests: - By letting the texture upload perform by the render thread (done by a dirty hack in our code): stability issue is gone. - By letting return the r600_can_invalidate_texture() always false, so that the reallocation is not done: stability issue is gone. These two tests proof that the race condition comes from the multi-threading aspect and the texture invalidation during texture upload. I suppose that the check in r600_texture_transfer_map(): if (r600_can_invalidate_texture(rctx->screen, rtex, usage, box)) r600_texture_invalidate_storage(rctx, rtex); else use_staging_texture = true; thus r600_can_invalidate_texture() returns true, while it shouldn’t as a bit later it is used in another thread by the glFlush command.
You are receiving this mail because:
- You are the assignee for the bug.
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel