[Bug 103304] multi-threaded usage of Gallium RadeonSI leads to NULL pointer exception in pb_cache_reclaim_buffer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bug ID 103304
Summary multi-threaded usage of Gallium RadeonSI leads to NULL pointer exception in pb_cache_reclaim_buffer
Product Mesa
Version 17.0
Hardware x86-64 (AMD64)
OS Linux (All)
Status NEW
Severity normal
Priority medium
Component Drivers/Gallium/radeonsi
Assignee dri-devel@lists.freedesktop.org
Reporter lper.home@gmail.com
QA Contact dri-devel@lists.freedesktop.org

Issue is not present in Mesa 11.X. It is however present in Mesa 13.0.X, 17.0.X
and as far as I can see in the code, it is probably as well present in latest
Mesa 17.2.X.
Our code is very similar as the second example in
https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading : we have two
contexts which are shared. In one context/thread the rendering is done and in
the other context/thread the texture uploading is done. It is in this case we
hit the race causing a crash (on average we need about an hour to hit the
issue).

The crash has following footprint:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  pb_cache_reclaim_buffer (mgr=mgr@entry=0x1e8dd30, size=size@entry=2088960,
alignment=alignment@entry=4096, usage=usage@entry=20,
    bucket_index=bucket_index@entry=3) at pipebuffer/pb_cache.c:183
#1  0x00007fe2671c50e7 in amdgpu_bo_create (rws=0x1e8dbf0, size=<optimized
out>, alignment=4096, domain=RADEON_DOMAIN_VRAM_GTT, flags=RADEON_FLAG_GTT_WC)
    at amdgpu_bo.c:754
#2  0x00007fe2671db666 in r600_alloc_resource (rscreen=rscreen@entry=0x1e8f0c0,
res=res@entry=0x7fe24c2d3100) at r600_buffer_common.c:197
#3  0x00007fe2671e6eff in r600_texture_invalidate_storage
(rctx=rctx@entry=0x1f9e900, rtex=rtex@entry=0x7fe24c2d3100) at
r600_texture.c:1414
#4  0x00007fe2671eb474 in r600_texture_transfer_map (ctx=0x1f9e900,
texture=0x7fe24c2d3100, level=0, usage=258, box=0x7fe265bca970,
    ptransfer=0x7fe265bca898) at r600_texture.c:1483
#5  0x00007fe267041807 in u_transfer_map_vtbl (context=<optimized out>,
resource=<optimized out>, level=<optimized out>, usage=<optimized out>,
    box=<optimized out>, transfer=<optimized out>) at util/u_transfer.c:138
#6  0x00007fe267041732 in u_default_texture_subdata (pipe=0x1f9e900,
resource=0x7fe24c2d3100, level=<optimized out>, usage=<optimized out>,
    box=0x7fe265bca970, data="" stride=1920, layer_stride=2088960)
at util/u_transfer.c:59
#7  0x00007fe266e51137 in st_TexSubImage (ctx=<optimized out>, dims=2,
texImage=<optimized out>, xoffset=0, yoffset=0, zoffset=0, width=1920,
    height=1088, depth=1, format=6403, type=5121, pixels=0x7fe218ac05e0,
unpack=0x2000fc0) at state_tracker/st_cb_texture.c:1412
#8  0x00007fe266dd75bf in _mesa_texture_sub_image (ctx=ctx@entry=0x1fe5d50,
dims=dims@entry=2, texObj=texObj@entry=0x7fe24c2d2ca0,
    texImage=0x7fe24c2cda20, target=target@entry=3553, level=level@entry=0,
xoffset=xoffset@entry=0, yoffset=yoffset@entry=0, zoffset=zoffset@entry=0,
    width=width@entry=1920, height=height@entry=1088, depth=depth@entry=1,
format=format@entry=6403, type=type@entry=5121,
    pixels=pixels@entry=0x7fe218ac05e0, dsa=dsa@entry=false) at
main/teximage.c:3239
#9  0x00007fe266dd7787 in texsubimage (ctx=0x1fe5d50, dims=dims@entry=2,
target=3553, level=0, xoffset=0, yoffset=0, zoffset=zoffset@entry=0,
    width=1920, height=1088, depth=depth@entry=1, format=format@entry=6403,
type=type@entry=5121, pixels=pixels@entry=0x7fe218ac05e0,
    callerName=callerName@entry=0x7fe26723c036 "glTexSubImage2D") at
main/teximage.c:3297
#10 0x00007fe266dd7b49 in _mesa_TexSubImage2D (target=<optimized out>,
level=<optimized out>, xoffset=<optimized out>, yoffset=<optimized out>,
    width=<optimized out>, height=<optimized out>, format=6403, type=5121,
pixels=0x7fe218ac05e0) at main/teximage.c:3438


If we enable the assert() handling in the mesa3d library, then this crash will
not occur, as an assert is triggered before:

#0  0x00007fd388fed124 in raise () from /lib64/libc.so.6
#1  0x00007fd388fee58a in abort () from /lib64/libc.so.6
#2  0x00007fd388fe5e47 in ?? () from /lib64/libc.so.6
#3  0x00007fd388fe5ef2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fd373986091 in pipe_reference_described (get_desc=<optimized out>,
reference=0x7fd35801b100, ptr=0x0)
    at gallium/auxiliary/util/u_inlines.h:82
#5  pipe_reference (reference=0x7fd35801b100, ptr=0x0) at
gallium/auxiliary/util/u_inlines.h:102
#6  pb_reference (src="" dst=0x2a260d0) at
gallium/auxiliary/pipebuffer/pb_buffer.h:241
#7  amdgpu_winsys_bo_reference (src="" dst=0x2a260d0) at
amdgpu_bo.h:116
#8  amdgpu_lookup_or_add_real_buffer (acs=0x3fea9d0, bo=0x7fd35801b100) at
amdgpu_cs.c:358
#9  0x00007fd3739863ac in amdgpu_cs_add_buffer (rcs=<optimized out>,
buf=<optimized out>, usage=10, domains=<optimized out>,
    priority=RADEON_PRIO_SAMPLER_TEXTURE) at amdgpu_cs.c:450
#10 0x00007fd3738d79fd in radeon_add_to_buffer_list
(priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ,
rbo=0x7fd358019cd0, ring=0x1eedeb8,
    rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:77
#11 radeon_add_to_buffer_list_check_mem (check_mem=false,
priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ,
rbo=0x7fd358019cd0,
    ring=0x1eedeb8, rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:114
#12 si_sampler_view_add_buffer (sctx=sctx@entry=0x1eedb60,
resource=0x7fd358019cd0, usage=usage@entry=RADEON_USAGE_READ,
    is_stencil_sampler=<optimized out>, check_mem=check_mem@entry=false) at
si_descriptors.c:316
#13 0x00007fd3738d7cb2 in si_sampler_views_begin_new_cs
(sctx=sctx@entry=0x1eedb60, views=views@entry=0x1eef360) at
si_descriptors.c:350
#14 0x00007fd3738dfd5a in si_all_descriptors_begin_new_cs
(sctx=sctx@entry=0x1eedb60) at si_descriptors.c:2019
#15 0x00007fd3738e0983 in si_begin_new_cs (ctx=ctx@entry=0x1eedb60) at
si_hw_context.c:227
#16 0x00007fd3738e14d3 in si_context_gfx_flush (context=0x1eedb60, flags=0,
fence=0x0) at si_hw_context.c:162
#17 0x00007fd37399c2a7 in r600_flush_from_st (ctx=0x1eedb60, fence=0x0,
flags=<optimized out>) at r600_pipe_common.c:381
#18 0x00007fd3735587ff in st_flush (st=st@entry=0x3e33870,
fence=fence@entry=0x0, flags=flags@entry=0) at state_tracker/st_cb_flush.c:87
#19 0x00007fd37355881e in st_glFlush (ctx=<optimized out>) at
state_tracker/st_cb_flush.c:121
#20 0x00007fd3733f7d71 in _mesa_flush (ctx=0x42cb4d0) at main/context.c:1838
#21 0x00007fd3733f8436 in _mesa_Flush () at main/context.c:1870

The thing that happens is a race between the texture uploading thread calling
the r600_texture_invalidate_storage() and the glFlush call in the rendering
thread calling the radeon_add_to_buffer_list() function:
In the radeon_add_to_buffer_list following code is executed:

  return rctx->ws->cs_add_buffer(
                  ring->cs, rbo->buf,
                  (enum radeon_bo_usage)(usage | RADEON_USAGE_SYNCHRONIZED),
                  rbo->domains, priority) * 4;

While in the function r600_alloc_resource the following code is executed:

        /* Replace the pointer such that if res->buf wasn't NULL, it won't be
         * NULL. This should prevent crashes with multiple contexts using
         * the same buffer where one of the contexts invalidates it while
         * the others are using it. */
        old_buf = res->buf;
        res->buf = new_buf; /* should be atomic */

Where both the rbo variable in radeon_add_to_buffer_list and res variable in
r600_alloc_resource are the same thing. In the further processing of
cs_add_buffer, the buffer is not linked anymore with the rbo as it has been
swapped in the other thread! The r600_alloc_resource will decrease the buffer
use reference so it gets zero, then causing the assert in the other thread
(where the assert checks the reference count).
Without the assert being enabled, the buf object will be cleaned up actually
setting its prev/next pointer to NULL and causing a crash in
pb_cache_reclaim_buffer when it is walking its bucket/cache list of buffers.

We performed a couple of tests:
-       By letting the texture upload perform by the render thread (done by a
dirty hack in our code): stability issue is gone.
-       By letting return the r600_can_invalidate_texture() always false, so
that the reallocation is not done: stability issue is gone.

These two tests proof that the race condition comes from the multi-threading
aspect and the texture invalidation during texture upload.

I suppose that the check in r600_texture_transfer_map():

                        if (r600_can_invalidate_texture(rctx->screen, rtex,
                                                        usage, box))
                                r600_texture_invalidate_storage(rctx, rtex);
                        else
                                use_staging_texture = true;

thus r600_can_invalidate_texture() returns true, while it shouldn’t as a bit
later it is used in another thread by the glFlush command.


You are receiving this mail because:
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux