Re: [Gluster-Maintainers] Memory overwrites due to processing vol files???

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




I was looking into the issue and  this is what I could find while working with shyam.

There are 2 things here.

1) The multiplexed brick process for the snapshot(s) getting the client volfile (I suspect, it happened
     when restore operation was performed).
2) Memory corruption happening while the multiplexed brick process is building the graph (for the client
     volfile it got above)

I have been able to reproduce the issue in my local computer once, when I ran the testcase tests/bugs/snapshot/bug-1275616.t

Upon comparison, we found that the backtrace of the core I got and the core generated in the regression runs was similar.
In fact, the victim information shyam mentioned before, is also similar in the core that I was able to get.  

On top of that, when the brick process was run with valgrind, it reported following memory corruption

==31257== Conditional jump or move depends on uninitialised value(s)
==31257==    at 0x1A7D0564: mdc_xattr_list_populate (md-cache.c:3127)
==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)
==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)
==31257==    by 0x4E62E67: xlator_init (xlator.c:709)
==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)
==31257==    by 0x518084E: rpc_transport_notify (rpc-transport.c:525)
==31257==    by 0x123273DF: socket_event_poll_in (socket.c:2504)
==31257==  Uninitialised value was created by a heap allocation
==31257==    at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==31257==    by 0x4E9F58E: __gf_malloc (mem-pool.c:136)
==31257==    by 0x1A7D052A: mdc_xattr_list_populate (md-cache.c:3123)
==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)
==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)
==31257==    by 0x4E62E67: xlator_init (xlator.c:709)
==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)

Based on the above observations, I think the below patch  by Shyam should fix the crash.

But, I am still trying understand, why a brick process should get a client volfile (i.e. the 1st issue mentioned above). 

Regards,
Raghavendra

On Wed, Sep 26, 2018 at 9:00 PM Shyam Ranganathan <srangana@xxxxxxxxxx> wrote:
On 09/26/2018 10:21 AM, Shyam Ranganathan wrote:
> 2. Testing dashboard to maintain release health (new, thanks Nigel)
>   - Dashboard at [2]
>   - We already have 3 failures here as follows, needs attention from
> appropriate *maintainers*,
>     (a)
> https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText
>       - Failed with core: ./tests/basic/afr/gfid-mismatch-resolution-with-cli.t
>     (b)
> https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText
>       - Failed with core: ./tests/bugs/snapshot/bug-1275616.t
>       - Also test ./tests/bugs/glusterd/validating-server-quorum.t had to be
> retried

I was looking at the cores from the above 2 instances, the one in job
873 is been a typical pattern, where malloc fails as there is internal
header corruption in the free bins.

When examining the victim that would have been allocated, it is often
carrying incorrect size and other magic information. If the data in
victim is investigated it looks like a volfile.

With the crash in 871, I thought there maybe a point where this is
detected earlier, but not able to make headway in the same.

So, what could be corrupting this memory and is it when the graph is
being processed? Can we run this with ASAN or such (I have not tried,
but need pointers if anyone has run tests with ASAN).

Here is the (brief) stack analysis of the core in 873:
NOTE: we need to start avoiding flushing the logs when we are dumping
core, as that leads to more memory allocations and causes a sort of
double fault in such cases.

Core was generated by `/build/install/sbin/glusterfsd -s
builder101.cloud.gluster.org --volfile-id /sn'.
Program terminated with signal 6, Aborted.
#0  0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f23cf591968 in __GI_abort () at abort.c:90
#2  0x00007f23cf5d2d37 in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at
../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007f23cf5db499 in malloc_printerr (ar_ptr=0x7f23bc000020,
ptr=<optimized out>, str=0x7f23cf6e4ea8 "free(): corrupted unsorted
chunks", action="" at malloc.c:5025
#4  _int_free (av=0x7f23bc000020, p=<optimized out>, have_lock=0) at
malloc.c:3847
#5  0x00007f23d0f7c6e4 in __gf_free (free_ptr=0x7f23bc0a56a0) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.c:356
#6  0x00007f23d0f41821 in log_buf_destroy (buf=0x7f23bc0a5568) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:358
#7  0x00007f23d0f44e55 in gf_log_flush_list (copy=0x7f23c404a290,
ctx=0x1ff6010) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1739
#8  0x00007f23d0f45081 in gf_log_flush_extra_msgs (ctx=0x1ff6010, new=0)
at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1807
#9  0x00007f23d0f4162d in gf_log_set_log_buf_size (buf_size=0) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:290
#10 0x00007f23d0f41acc in gf_log_disable_suppression_before_exit
(ctx=0x1ff6010) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:444
#11 0x00007f23d0f4c027 in gf_print_trace (signum=6, ctx=0x1ff6010) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/common-utils.c:922
#12 0x000000000040a84a in glusterfsd_print_trace (signum=6) at
/home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2316
#13 <signal handler called>
#14 0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#15 0x00007f23cf591968 in __GI_abort () at abort.c:90
#16 0x00007f23cf5d2d37 in __libc_message (do_abort=2,
fmt=fmt@entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at
../sysdeps/unix/sysv/linux/libc_fatal.c:196
#17 0x00007f23cf5dcc86 in malloc_printerr (ar_ptr=0x7f23bc000020,
ptr=0x7f23bc003cd0, str=0x7f23cf6e245b "malloc(): memory corruption",
action="" out>) at malloc.c:5025
#18 _int_malloc (av=av@entry=0x7f23bc000020, bytes=bytes@entry=15664) at
malloc.c:3473
#19 0x00007f23cf5df84c in __GI___libc_malloc (bytes=15664) at malloc.c:2899
#20 0x00007f23d0f3bbbf in __gf_default_malloc (size=15664) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.h:106
#21 0x00007f23d0f3f02f in xlator_mem_acct_init (xl=0x7f23bc082b20,
num_types=163) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:800
#22 0x00007f23b90a37bf in mem_acct_init (this=0x7f23bc082b20) at
/home/jenkins/root/workspace/regression-test-with-multiplex/xlators/performance/open-behind/src/open-behind.c:1189
#23 0x00007f23d0f3ebe8 in xlator_init (xl=0x7f23bc082b20) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:700
#24 0x00007f23d0f8fb5f in glusterfs_graph_init (graph=0x7f23bc010570) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:359
#25 0x00007f23d0f907ac in glusterfs_graph_activate
(graph=0x7f23bc010570, ctx=0x1ff6010) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:722
#26 0x000000000040af89 in glusterfs_process_volfp (ctx=0x1ff6010,
fp=0x7f23bc00b0a0) at
/home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2528
#27 0x000000000041094c in mgmt_getspec_cbk (req=0x7f23a4004f78,
iov=0x7f23a4004fb8, count=1, myframe=0x7f23a4002b88)
    at
/home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd-mgmt.c:2076
#28 0x00007f23d0d0617d in rpc_clnt_handle_reply (clnt=0x2077910,
pollin=0x7f23bc001e80) at
/home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:755
#29 0x00007f23d0d066ad in rpc_clnt_notify (trans=0x2077c70,
mydata=0x2077940, event=RPC_TRANSPORT_MSG_RECEIVED, data="">     at
/home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:923
#30 0x00007f23d0d02895 in rpc_transport_notify (this=0x2077c70,
event=RPC_TRANSPORT_MSG_RECEIVED, data="">     at
/home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-transport.c:525
#31 0x00007f23c5b143ff in socket_event_poll_in (this=0x2077c70,
notify_handled=true) at
/home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2504
#32 0x00007f23c5b153e0 in socket_event_handler (fd=9, idx=1, gen=1,
data="" poll_in=1, poll_out=0, poll_err=0)
    at
/home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2905
#33 0x00007f23d0fbd3bc in event_dispatch_epoll_handler
(event_pool=0x202dc40, event=0x7f23c404bea0) at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:591
#34 0x00007f23d0fbd6b5 in event_dispatch_epoll_worker (data=""> at
/home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:668
#35 0x00007f23cfd8fe25 in start_thread (arg=0x7f23c404c700) at
pthread_create.c:308
#36 0x00007f23cf658bad in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) p victim
$1 = (struct malloc_chunk *) 0x7f23bc003cc0

(gdb) x/16c (char *)victim - 16
0x7f23bc003cb0: 54 '6'  48 '0'  57 '9'  53 '5'  13 '\r' -16 '\360'      -83
'\255'  -70 '\272'
0x7f23bc003cb8: 56 '8'  57 '9'  51 '3'  48 '0'  50 '2'  99 'c'  99 'c'  55 '7'
(gdb)
0x7f23bc003cc0: 50 '2'  52 '4'  47 '/'  98 'b'  114 'r' 105 'i' 99 'c'  107 'k'
0x7f23bc003cc8: 33 '!'  4 '\004'        115 's' 101 'e' 99 'c'  117 'u' 114 'r'
105 'i'
(gdb)
0x7f23bc003cd0: 116 't' 121 'y' 46 '.'  99 'c'  97 'a'  112 'p' 97 'a'  98 'b'
0x7f23bc003cd8: 105 'i' 108 'l' 105 'i' 116 't' 121 'y' 44 ','  115 's'
101 'e'
(gdb)
0x7f23bc003ce0: 99 'c'  117 'u' 114 'r' 105 'i' 116 't' 121 'y' 46 '.'
105 'i'
0x7f23bc003ce8: 109 'm' 97 'a'  44 ','  117 'u' 115 's' 101 'e' 114 'r' 46 '.'
(gdb)
0x7f23bc003cf0: 115 's' 119 'w' 105 'i' 102 'f' 116 't' 46 '.'  109 'm'
101 'e'
0x7f23bc003cf8: 116 't' 97 'a'  100 'd' 97 'a'  116 't' 97 'a'  44 ','  0 '\000'
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux