Hi Pranith,I have some questions if you can answer them:What in LIBC exit() routine has resulted in SIGSEGV in this case ?
#0 0x00003fffad4463b0 in _IO_unbuffer_all () at genops.c:960
#1 _IO_cleanup () at genops.c:1020
#2 0x00003fffad400d00 in __run_exit_handlers (status=<optimized out>, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true) at exit.c:95
#3 0x00003fffad400e1c in __GI_exit (status=<optimized out>) at exit.c:104
- Why the call trace always point to LIBC exit() in all these crash instances on gluster ?
- Can there be any connection between LIBC exit() crash and SIGTERM handling at early start of gluster ?
Regards,
Abhishek
On Tue, Sep 25, 2018 at 2:27 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:On Tue, Sep 25, 2018 at 2:17 PM ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:I don't have the step to reproduce, but its a race condition where it seems cleanup_and_exit() is accessing the data structure which are not yet initialised (as gluster is in starting phase), due to SIGTERM/SIGINT is sent in between.But the crash happened inside exit() code for which will be in libc which doesn't access any data structures in glusterfs.Regards,AbhishekOn Mon, Sep 24, 2018 at 9:11 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:On Mon, Sep 24, 2018 at 5:16 PM ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:Hi Pranith,As we know this problem is getting triggered at startup of the glusterd process when it received the SIGTERM.I think there is a problem in glusterfs code, if at startup someone sent the SIGTERM the exit handler should not be crash instead it should with some information.Could please let me know the possibility to fix it from glusterfs side?I am not as confident as you about the RC you provided. If you could give the steps to re-create, I will be happy to confirm that the RC is correct and then I will send out the fix.Regards,AbhishekOn Mon, Sep 24, 2018 at 3:12 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:On Mon, Sep 24, 2018 at 2:09 PM ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:Could you please let me know about the bug in libc which you are talking.No, I mean, if you give the steps to reproduce, we will be able to pin point if the issue is with libc or glusterfs.On Mon, Sep 24, 2018 at 2:01 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:On Mon, Sep 24, 2018 at 1:57 PM ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:If you see the source code in cleanup_and_exit() we are getting the SIGSEGV crash when 'exit(0)' is triggered.yes, that is what I was mentioning earlier. It is crashing in libc. So either there is a bug in libc (glusterfs actually found 1 bug so far in libc, so I wouldn't rule out that possibility) or there is something that is happening in glusterfs which is leading to the problem. Valgrind/address-sanitizer would help find where the problem could be in some cases, so before reaching out libc developers, it is better to figure out where the problem is. Do you have steps to recreate it?On Mon, Sep 24, 2018 at 1:41 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:On Mon, Sep 24, 2018 at 1:36 PM ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:Hi Sanju,Do you have any update on this?This seems to happen while the process is dying, in libc. I am not completely sure if there is anything glusterfs is contributing to it from the bt at the moment. Do you have any steps to re-create this problem? It is probably better to run the steps with valgrind/address-sanitizer and see if it points to the problem in glusterfs.Regards,AbhishekOn Fri, Sep 21, 2018 at 4:07 PM ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:Hi Sanju,Output of 't a a bt full'(gdb) t a a bt full
Thread 7 (LWP 1743):
#0 0x00003fffa3ea7e88 in __lll_lock_wait (futex=0x0, private=0) at lowlevellock.c:43
r4 = 128
r7 = 0
arg2 = 128
r5 = 2
r8 = 1
r0 = 221
r3 = 0
r6 = 0
arg1 = 0
__err = 221
__ret = 0
#1 0x00003fffa3e9ef64 in __GI___pthread_mutex_lock (mutex=0x100272a8) at ../nptl/pthread_mutex_lock.c:81
__futex = 0x100272a8
__PRETTY_FUNCTION__ = "__pthread_mutex_lock"
type = <optimized out>
id = <optimized out>
#2 0x00003fffa3f6ce8c in _gf_msg (domain=0x3fff98006c90 "c_glusterfs-client-0", file=0x3fff9fb34de0 "client.c", function=0x3fff9fb34cd8 <__FUNCTION__.18849> "notify",
line=<optimized out>, level=<optimized out>, errnum=<optimized out>, trace=<optimized out>, msgid=114020,
fmt=0x3fff9fb35350 "parent translators are ready, attempting connect on transport") at logging.c:2058
ret = <optimized out>
msgstr = <optimized out>
ap = <optimized out>
this = 0x3fff980061f0
ctx = 0x10027010
callstr = '\000' <repeats 4095 times>
passcallstr = 0
log_inited = 0
__PRETTY_FUNCTION__ = "_gf_msg"
#3 0x00003fff9fb084ac in notify (this=0x3fff980061f0, event=<optimized out>, data="" at client.c:2116
conf = 0x3fff98056dd0
__FUNCTION__ = "notify"
#4 0x00003fffa3f68ca0 in xlator_notify (xl=0x3fff980061f0, event=<optimized out>, data="" out>) at xlator.c:491
old_THIS = 0x3fff98008c50
ret = 0
#5 0x00003fffa3f87700 in default_notify (this=0x3fff98008c50, event=<optimized out>, data="" out>) at defaults.c:2302
list = 0x3fff9800a340
#6 0x00003fff9fac922c in afr_notify (this=0x3fff98008c50, event=1, data="" data2=<optimized out>) at afr-common.c:3967
priv = 0x3fff98010050
i = <optimized out>
up_children = <optimized out>
down_children = <optimized out>
propagate = 1
had_heard_from_all = <optimized out>
---Type <return> to continue, or q <return> to quit---
have_heard_from_all = 0
idx = <optimized out>
ret = 0
call_psh = <optimized out>
input = 0x0
output = 0x0
had_quorum = <optimized out>
has_quorum = <optimized out>
__FUNCTION__ = "afr_notify"
#7 0x00003fff9fad4994 in notify (this=<optimized out>, event=<optimized out>, data="" out>) at afr.c:38
ret = -1
ap = 0x3fffa034cc58 ""
data2 = <optimized out>
#8 0x00003fffa3f68ca0 in xlator_notify (xl=0x3fff98008c50, event=<optimized out>, data="" out>) at xlator.c:491
old_THIS = 0x3fff9800a4c0
ret = 0
#9 0x00003fffa3f87700 in default_notify (this=0x3fff9800a4c0, event=<optimized out>, data="" out>) at defaults.c:2302
list = 0x3fff9800b710
#10 0x00003fff9fa6b1e4 in notify (this=<optimized out>, event=<optimized out>, data="" out>) at io-stats.c:3064
ret = 0
args = {type = IOS_DUMP_TYPE_NONE, u = {logfp = 0x0, dict = 0x0}}
op = 0
list_cnt = 0
throughput = 0
time = 0
is_peek = _gf_false
ap = 0x3fffa034ce68 ""
__FUNCTION__ = "notify"
#11 0x00003fffa3f68ca0 in xlator_notify (xl=0x3fff9800a4c0, event=<optimized out>, data="" out>) at xlator.c:491
old_THIS = 0x3fffa402d290 <global_xlator>
ret = 0
#12 0x00003fffa3fbd560 in glusterfs_graph_parent_up (graph=<optimized out>) at graph.c:440
trav = 0x3fff9800a4c0
ret = <optimized out>
#13 0x00003fffa3fbdb90 in glusterfs_graph_activate (graph=0x3fff98000af0, ctx=0x10027010) at graph.c:688
ret = <optimized out>
__FUNCTION__ = "glusterfs_graph_activate"
#14 0x000000001000a49c in glusterfs_process_volfp (ctx=0x10027010, fp=0x3fff98001cd0) at glusterfsd.c:2221
graph = 0x3fff98000af0
ret = <optimized out>
trav = <optimized out>
__FUNCTION__ = <error reading variable __FUNCTION__ (Cannot access memory at address 0x10010ec0)>
#15 0x000000001000fd08 in mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x3fffa2bea06c) at glusterfsd-mgmt.c:1561
rsp = {op_ret = 1059, op_errno = 0,
spec = 0x3fff980018a0 "volume c_glusterfs-client-0\n type protocol/client\n option password 5fd8d83d-99f3-4630-97c6-965d7a8ead62\n option username e65687aa-e135-445e-8778-48bb8fb19640\n option transport-type tcp\n "..., xdata = {xdata_len = 0, xdata_val = 0x0}}
---Type <return> to continue, or q <return> to quit---
frame = 0x3fffa2bea06c
ctx = 0x10027010
ret = <optimized out>
size = 1059
tmpfp = 0x3fff98001cd0
volfilebuf = 0x0
__FUNCTION__ = <error reading variable __FUNCTION__ (Cannot access memory at address 0x10013570)>
#16 0x00003fffa3f21ec4 in rpc_clnt_handle_reply (clnt=0x10089020, pollin=0x3fff98001760) at rpc-clnt.c:775
conn = 0x10089050
saved_frame = <optimized out>
ret = <optimized out>
req = 0x1008931c
xid = 1
__FUNCTION__ = "rpc_clnt_handle_reply"
#17 0x00003fffa3f223d0 in rpc_clnt_notify (trans=<optimized out>, mydata=0x10089050, event=<optimized out>, data="" out>) at rpc-clnt.c:933
conn = 0x10089050
clnt = <optimized out>
ret = -1
req_info = 0x0
pollin = <optimized out>
clnt_mydata = 0x0
old_THIS = 0x3fffa402d290 <global_xlator>
__FUNCTION__ = "rpc_clnt_notify"
#18 0x00003fffa3f1d4fc in rpc_transport_notify (this=<optimized out>, event=<optimized out>, data="" out>) at rpc-transport.c:546
ret = -1
__FUNCTION__ = "rpc_transport_notify"
#19 0x00003fffa0401d44 in socket_event_poll_in (this=this@entry=0x1008ab80) at socket.c:2236
ret = <optimized out>
pollin = 0x3fff98001760
priv = 0x1008b820
#20 0x00003fffa040489c in socket_event_handler (fd=<optimized out>, idx=<optimized out>, data="" poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>)
at socket.c:2349
this = 0x1008ab80
priv = 0x1008b820
ret = <optimized out>
__FUNCTION__ = "socket_event_handler"
#21 0x00003fffa3fe2874 in event_dispatch_epoll_handler (event=0x3fffa034d6a0, event_pool=0x10045bc0) at event-epoll.c:575
handler = @0x3fffa041f620: 0x3fffa04046f0 <socket_event_handler>
gen = 1
slot = 0x1007cd80
data = "" out>
ret = -1
fd = 9
ev_data = 0x3fffa034d6a8
idx = 1
#22 event_dispatch_epoll_worker (data="" at event-epoll.c:678
---Type <return> to continue, or q <return> to quit---
event = {events = 1, data = "" = 0x100000001, fd = 1, u32 = 1, u64 = 4294967297}}
ret = <optimized out>
ev_data = 0x1008bd50
event_pool = 0x10045bc0
myindex = <optimized out>
timetodie = 0
__FUNCTION__ = "event_dispatch_epoll_worker"
#23 0x00003fffa3e9bb30 in start_thread (arg=0x3fffa034e160) at pthread_create.c:462
pd = 0x3fffa034e160
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {2380233324717718430, 70367199403008, 2380233324703897146, 0, 0, 70367128645632, 70367137030688, 8388608, 70367199363104, 269008208,
70368094386592, 70367199388632, 70367200825640, 3, 0, 70367199388648, 70368094386240, 70368094386296, 4001536, 70367199364120, 70367137027904, -3187653596,
0 <repeats 42 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = "" = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#24 0x00003fffa3de60fc in .__clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:96
No locals.
Thread 6 (LWP 1735):
#0 0x00003fffa3ea3ccc in __pthread_cond_timedwait (cond=0x10059a98, mutex=0x10059a70, abstime=0x3fffa141f670) at pthread_cond_timedwait.c:198
r4 = 393
r7 = 0
arg5 = 0
arg2 = <optimized out>
r5 = 2
r8 = 4294967295
arg6 = 4294967295
arg3 = 2
r0 = 221
r3 = 516
r6 = 70367154665072
arg4 = 70367154665072
arg1 = 268802716
__err = <optimized out>
__ret = <optimized out>
futex_val = 2
buffer = {__routine = @0x3fffa3ec0b50: 0x3fffa3ea3400 <__condvar_cleanup>, __arg = 0x3fffa141f540, __canceltype = 0, __prev = 0x0}
cbuffer = {oldtype = 0, cond = 0x10059a98, mutex = 0x10059a70, bc_seq = 0}
result = 0
pshared = 0
pi_flag = 0
err = <optimized out>
val = <optimized out>
---Type <return> to continue, or q <return> to quit---
seq = 0
#1 0x00003fffa3fc0e74 in syncenv_task (proc=0x10053eb0) at syncop.c:607
env = 0x10053eb0
task = 0x0
sleep_till = {tv_sec = 1536845230, tv_nsec = 0}
ret = <optimized out>
#2 0x00003fffa3fc1cdc in syncenv_processor (thdata=0x10053eb0) at syncop.c:699
env = 0x10053eb0
proc = 0x10053eb0
task = <optimized out>
#3 0x00003fffa3e9bb30 in start_thread (arg=0x3fffa1420160) at pthread_create.c:462
pd = 0x3fffa1420160
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {2380233324731669406, 70367199403008, 2380233324703897146, 0, 0, 70367146283008, 70367154668064, 8388608, 70367199363104, 268779184,
268779184, 70367199388632, 70367200820192, 3, 0, 70367199388648, 70368094386080, 70368094386136, 4001536, 70367199364120, 70367154665280, -3187653564,
0 <repeats 42 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = "" = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#4 0x00003fffa3de60fc in .__clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:96
No locals.
Thread 5 (LWP 1746):
#0 0x00003fffa3ea7e38 in __lll_lock_wait (futex=0x100272a8, private=<optimized out>) at lowlevellock.c:46
r4 = 128
r7 = 0
r5 = 2
r8 = 1
arg3 = 2
r0 = 221
r3 = 512
r6 = 0
arg4 = 0
arg1 = 268595880
__err = <optimized out>
__ret = <optimized out>
#1 0x00003fffa3e9ef64 in __GI___pthread_mutex_lock (mutex=0x100272a8) at ../nptl/pthread_mutex_lock.c:81
__futex = 0x100272a8
<p class="m_-6459907265332461971m_-5579429898042230588m_3924698905562214407m_-746409795816116485m_5332120683958399093m_6916172682174822483m_-2822
--
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users