We tested with ASAN and without the fix at [1], and it consistently crashes at the mdcache xlator when brick mux is enabled. On 09/28/2018 03:50 PM, FNU Raghavendra Manjunath wrote: > > I was looking into the issue and this is what I could find while > working with shyam. > > There are 2 things here. > > 1) The multiplexed brick process for the snapshot(s) getting the client > volfile (I suspect, it happened > when restore operation was performed). > 2) Memory corruption happening while the multiplexed brick process is > building the graph (for the client > volfile it got above) > > I have been able to reproduce the issue in my local computer once, when > I ran the testcase tests/bugs/snapshot/bug-1275616.t > > Upon comparison, we found that the backtrace of the core I got and the > core generated in the regression runs was similar. > In fact, the victim information shyam mentioned before, is also similar > in the core that I was able to get. > > On top of that, when the brick process was run with valgrind, it > reported following memory corruption > > ==31257== Conditional jump or move depends on uninitialised value(s) > ==31257== at 0x1A7D0564: mdc_xattr_list_populate (md-cache.c:3127) > ==31257== by 0x1A7D1903: mdc_init (md-cache.c:3486) > ==31257== by 0x4E62D41: __xlator_init (xlator.c:684) > ==31257== by 0x4E62E67: xlator_init (xlator.c:709) > ==31257== by 0x4EB2BEB: glusterfs_graph_init (graph.c:359) > ==31257== by 0x4EB37F8: glusterfs_graph_activate (graph.c:722) > ==31257== by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528) > ==31257== by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076) > ==31257== by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755) > ==31257== by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923) > ==31257== by 0x518084E: rpc_transport_notify (rpc-transport.c:525) > ==31257== by 0x123273DF: socket_event_poll_in (socket.c:2504) > ==31257== Uninitialised value was created by a heap allocation > ==31257== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) > ==31257== by 0x4E9F58E: __gf_malloc (mem-pool.c:136) > ==31257== by 0x1A7D052A: mdc_xattr_list_populate (md-cache.c:3123) > ==31257== by 0x1A7D1903: mdc_init (md-cache.c:3486) > ==31257== by 0x4E62D41: __xlator_init (xlator.c:684) > ==31257== by 0x4E62E67: xlator_init (xlator.c:709) > ==31257== by 0x4EB2BEB: glusterfs_graph_init (graph.c:359) > ==31257== by 0x4EB37F8: glusterfs_graph_activate (graph.c:722) > ==31257== by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528) > ==31257== by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076) > ==31257== by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755) > ==31257== by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923) > > Based on the above observations, I think the below patch by Shyam > should fix the crash. [1] > https://review.gluster.org/#/c/glusterfs/+/21299/ > > But, I am still trying understand, why a brick process should get a > client volfile (i.e. the 1st issue mentioned above). > > Regards, > Raghavendra > > On Wed, Sep 26, 2018 at 9:00 PM Shyam Ranganathan <srangana@xxxxxxxxxx > <mailto:srangana@xxxxxxxxxx>> wrote: > > On 09/26/2018 10:21 AM, Shyam Ranganathan wrote: > > 2. Testing dashboard to maintain release health (new, thanks Nigel) > > - Dashboard at [2] > > - We already have 3 failures here as follows, needs attention from > > appropriate *maintainers*, > > (a) > > > https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText > > - Failed with core: > ./tests/basic/afr/gfid-mismatch-resolution-with-cli.t > > (b) > > > https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText > > - Failed with core: ./tests/bugs/snapshot/bug-1275616.t > > - Also test ./tests/bugs/glusterd/validating-server-quorum.t > had to be > > retried > > I was looking at the cores from the above 2 instances, the one in job > 873 is been a typical pattern, where malloc fails as there is internal > header corruption in the free bins. > > When examining the victim that would have been allocated, it is often > carrying incorrect size and other magic information. If the data in > victim is investigated it looks like a volfile. > > With the crash in 871, I thought there maybe a point where this is > detected earlier, but not able to make headway in the same. > > So, what could be corrupting this memory and is it when the graph is > being processed? Can we run this with ASAN or such (I have not tried, > but need pointers if anyone has run tests with ASAN). > > Here is the (brief) stack analysis of the core in 873: > NOTE: we need to start avoiding flushing the logs when we are dumping > core, as that leads to more memory allocations and causes a sort of > double fault in such cases. > > Core was generated by `/build/install/sbin/glusterfsd -s > builder101.cloud.gluster.org <http://builder101.cloud.gluster.org> > --volfile-id /sn'. > Program terminated with signal 6, Aborted. > #0 0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > (gdb) bt > #0 0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x00007f23cf591968 in __GI_abort () at abort.c:90 > #2 0x00007f23cf5d2d37 in __libc_message (do_abort=do_abort@entry=2, > fmt=fmt@entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at > ../sysdeps/unix/sysv/linux/libc_fatal.c:196 > #3 0x00007f23cf5db499 in malloc_printerr (ar_ptr=0x7f23bc000020, > ptr=<optimized out>, str=0x7f23cf6e4ea8 "free(): corrupted unsorted > chunks", action=3) at malloc.c:5025 > #4 _int_free (av=0x7f23bc000020, p=<optimized out>, have_lock=0) at > malloc.c:3847 > #5 0x00007f23d0f7c6e4 in __gf_free (free_ptr=0x7f23bc0a56a0) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.c:356 > #6 0x00007f23d0f41821 in log_buf_destroy (buf=0x7f23bc0a5568) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:358 > #7 0x00007f23d0f44e55 in gf_log_flush_list (copy=0x7f23c404a290, > ctx=0x1ff6010) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1739 > #8 0x00007f23d0f45081 in gf_log_flush_extra_msgs (ctx=0x1ff6010, new=0) > at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1807 > #9 0x00007f23d0f4162d in gf_log_set_log_buf_size (buf_size=0) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:290 > #10 0x00007f23d0f41acc in gf_log_disable_suppression_before_exit > (ctx=0x1ff6010) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:444 > #11 0x00007f23d0f4c027 in gf_print_trace (signum=6, ctx=0x1ff6010) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/common-utils.c:922 > #12 0x000000000040a84a in glusterfsd_print_trace (signum=6) at > /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2316 > #13 <signal handler called> > #14 0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #15 0x00007f23cf591968 in __GI_abort () at abort.c:90 > #16 0x00007f23cf5d2d37 in __libc_message (do_abort=2, > fmt=fmt@entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at > ../sysdeps/unix/sysv/linux/libc_fatal.c:196 > #17 0x00007f23cf5dcc86 in malloc_printerr (ar_ptr=0x7f23bc000020, > ptr=0x7f23bc003cd0, str=0x7f23cf6e245b "malloc(): memory corruption", > action=<optimized out>) at malloc.c:5025 > #18 _int_malloc (av=av@entry=0x7f23bc000020, bytes=bytes@entry=15664) at > malloc.c:3473 > #19 0x00007f23cf5df84c in __GI___libc_malloc (bytes=15664) at > malloc.c:2899 > #20 0x00007f23d0f3bbbf in __gf_default_malloc (size=15664) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.h:106 > #21 0x00007f23d0f3f02f in xlator_mem_acct_init (xl=0x7f23bc082b20, > num_types=163) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:800 > #22 0x00007f23b90a37bf in mem_acct_init (this=0x7f23bc082b20) at > /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/performance/open-behind/src/open-behind.c:1189 > #23 0x00007f23d0f3ebe8 in xlator_init (xl=0x7f23bc082b20) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:700 > #24 0x00007f23d0f8fb5f in glusterfs_graph_init (graph=0x7f23bc010570) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:359 > #25 0x00007f23d0f907ac in glusterfs_graph_activate > (graph=0x7f23bc010570, ctx=0x1ff6010) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:722 > #26 0x000000000040af89 in glusterfs_process_volfp (ctx=0x1ff6010, > fp=0x7f23bc00b0a0) at > /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2528 > #27 0x000000000041094c in mgmt_getspec_cbk (req=0x7f23a4004f78, > iov=0x7f23a4004fb8, count=1, myframe=0x7f23a4002b88) > at > /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd-mgmt.c:2076 > #28 0x00007f23d0d0617d in rpc_clnt_handle_reply (clnt=0x2077910, > pollin=0x7f23bc001e80) at > /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:755 > #29 0x00007f23d0d066ad in rpc_clnt_notify (trans=0x2077c70, > mydata=0x2077940, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80) > at > /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:923 > #30 0x00007f23d0d02895 in rpc_transport_notify (this=0x2077c70, > event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80) > at > /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-transport.c:525 > #31 0x00007f23c5b143ff in socket_event_poll_in (this=0x2077c70, > notify_handled=true) at > /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2504 > #32 0x00007f23c5b153e0 in socket_event_handler (fd=9, idx=1, gen=1, > data=0x2077c70, poll_in=1, poll_out=0, poll_err=0) > at > /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2905 > #33 0x00007f23d0fbd3bc in event_dispatch_epoll_handler > (event_pool=0x202dc40, event=0x7f23c404bea0) at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:591 > #34 0x00007f23d0fbd6b5 in event_dispatch_epoll_worker (data=0x2079470) > at > /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:668 > #35 0x00007f23cfd8fe25 in start_thread (arg=0x7f23c404c700) at > pthread_create.c:308 > #36 0x00007f23cf658bad in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > (gdb) p victim > $1 = (struct malloc_chunk *) 0x7f23bc003cc0 > > (gdb) x/16c (char *)victim - 16 > 0x7f23bc003cb0: 54 '6' 48 '0' 57 '9' 53 '5' 13 '\r' -16 '\360' > -83 > '\255' -70 '\272' > 0x7f23bc003cb8: 56 '8' 57 '9' 51 '3' 48 '0' 50 '2' 99 'c' 99 > 'c' 55 '7' > (gdb) > 0x7f23bc003cc0: 50 '2' 52 '4' 47 '/' 98 'b' 114 'r' 105 'i' 99 > 'c' 107 'k' > 0x7f23bc003cc8: 33 '!' 4 '\004' 115 's' 101 'e' 99 'c' 117 > 'u' 114 'r' > 105 'i' > (gdb) > 0x7f23bc003cd0: 116 't' 121 'y' 46 '.' 99 'c' 97 'a' 112 'p' 97 > 'a' 98 'b' > 0x7f23bc003cd8: 105 'i' 108 'l' 105 'i' 116 't' 121 'y' 44 ',' 115 's' > 101 'e' > (gdb) > 0x7f23bc003ce0: 99 'c' 117 'u' 114 'r' 105 'i' 116 't' 121 'y' 46 '.' > 105 'i' > 0x7f23bc003ce8: 109 'm' 97 'a' 44 ',' 117 'u' 115 's' 101 'e' 114 > 'r' 46 '.' > (gdb) > 0x7f23bc003cf0: 115 's' 119 'w' 105 'i' 102 'f' 116 't' 46 '.' 109 'm' > 101 'e' > 0x7f23bc003cf8: 116 't' 97 'a' 100 'd' 97 'a' 116 't' 97 'a' 44 > ',' 0 '\000' > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > https://lists.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel