Re: Spurious regression failure analysis from runs over the wkend

Shyam <srangana@xxxxxxxxxx> · Mon, 23 Feb 2015 14:32:41 -0500

On 02/23/2015 01:58 PM, Justin Clift wrote:
Short version:

75% of the Jenkins regression tests we run in Rackspace (on
glusterfs master branch) fail from spurious errors.

This is why we're having capacity problems with our Jenkins
slave nodes... we need to run our tests 4x for each CR just
to get a potentially valid result. :/

Longer version:

Ran some regression test runs (20) on git master head over the
weekend, to better understand our spurious failure situation.

75% of the regression runs failed in various ways.  Oops. ;)

The failures:

   * 5 x tests/bugs/fuse/bug-1126048.t
         Failed test:  10

   * 3 x tests/bugs/quota/bug-1087198.t
         Failed test:  18

   * 3 x tests/performance/open-behind.t
         Failed test:  17

   * 2 x tests/bugs/geo-replication/bug-877293.t
         Failed test:  11

   * 2 x tests/basic/afr/split-brain-heal-info.t
         Failed tests:  20-41

   * 1 x tests/bugs/distribute/bug-1117851.t
         Failed test:  15

   * 1 x tests/basic/uss.t
         Failed test:  26

   * 1 x hung on tests/bugs/posix/bug-1113960.t

         No idea which test it was on.  Left it running
         several hours, then killed the VM along with the rest.

4 of the regression runs also created coredumps.  Uploaded the
archived_builds and logs here:

     http://mirror.salasaga.org/gluster/

(are those useful?)

Yes, these are useful as they contain a very similar crash in each of 
the cores, so we could be looking at a single problem to fix here. Here 
is a short update on the core, at a broad level the cleanup_and_exit is 
racing with a list deletion in the following 2 threads.

Those interested can download and extract the tarballs from the link 
provided, (ex: 
http://mirror.salasaga.org/gluster/bulkregression12/archived_builds/build-install-20150222%3a19%3a58%3a21.tar.bz2 
)
and run, "gdb -ex 'set sysroot ./' -ex 'core-file 
./build/install/cores/core.28008' ./build/install/sbin/glusterfsd" from 
the root of the extracted tarball to look at the details from the core dump.

Core was generated by `/build/install/sbin/glusterfsd -s 
bulkregression12.localdomain --volfile-id pat'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at 
/root/glusterfs/libglusterfs/src/list.h:88
88      /root/glusterfs/libglusterfs/src/list.h: No such file or directory.

1) list deletion generates the core:

(gdb) bt
#0  0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at 
/root/glusterfs/libglusterfs/src/list.h:88
#1  0x00007fd84a1352ae in pl_inodelk_client_cleanup 
(this=0x7fd84400b7e0, ctx=0x7fd834000b50) at 
/root/glusterfs/xlators/features/locks/src/inodelk.c:471
#2  0x00007fd84a131805 in pl_client_disconnect_cbk (this=0x7fd84400b7e0, 
client=0x7fd83c002fd0) at 
/root/glusterfs/xlators/features/locks/src/posix.c:2563
#3  0x00007fd85bd52139 in gf_client_disconnect (client=0x7fd83c002fd0) 
at /root/glusterfs/libglusterfs/src/client_t.c:393
#4  0x00007fd849262296 in server_connection_cleanup 
(this=0x7fd844014350, client=0x7fd83c002fd0, flags=3) at 
/root/glusterfs/xlators/protocol/server/src/server-helpers.c:353
#5  0x00007fd84925dcca in server_rpc_notify (rpc=0x7fd844023b70, 
xl=0x7fd844014350, event=RPCSVC_EVENT_DISCONNECT, data=0x7fd83c001440) 
at /root/glusterfs/xlators/protocol/server/src/server.c:532
#6  0x00007fd85baaa021 in rpcsvc_handle_disconnect (svc=0x7fd844023b70, 
trans=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:741
#7  0x00007fd85baaa1ba in rpcsvc_notify (trans=0x7fd83c001440, 
mydata=0x7fd844023b70, event=RPC_TRANSPORT_DISCONNECT, 
data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:779
#8  0x00007fd85baaf4a4 in rpc_transport_notify (this=0x7fd83c001440, 
event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at 
/root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:543
#9  0x00007fd850c8fbc0 in socket_event_poll_err (this=0x7fd83c001440) at 
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:1185
#10 0x00007fd850c9457e in socket_event_handler (fd=14, idx=5, 
data=0x7fd83c001440, poll_in=1, poll_out=0, poll_err=0) at 
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2386
#11 0x00007fd85bd55333 in event_dispatch_epoll_handler 
(event_pool=0x1d835d0, event=0x7fd84b5a9e70) at 
/root/glusterfs/libglusterfs/src/event-epoll.c:551
#12 0x00007fd85bd5561d in event_dispatch_epoll_worker (data=0x1db0790) 
at /root/glusterfs/libglusterfs/src/event-epoll.c:643
#13 0x00007fd85b24f9d1 in start_thread () from ./lib64/libpthread.so.0
#14 0x00007fd85abb98fd in clone () from ./lib64/libc.so.6

2) Parallel cleanup in progress, (see frame #12 on cleanup_and_exit)

Thread 12 (LWP 28010):
#0  0x00007f8620a31f48 in _nss_files_parse_servent () from 
./lib64/libnss_files.so.2
#1  0x00007f8620a326b0 in _nss_files_getservbyport_r () from 
./lib64/libnss_files.so.2
#2  0x00007f862b595c39 in getservbyport_r@@GLIBC_2.2.5 () from 
./lib64/libc.so.6
#3  0x00007f862b59c536 in getnameinfo () from ./lib64/libc.so.6
#4  0x00007f862c6beb64 in gf_resolve_ip6 (hostname=0x1702860 
"bulkregression16.localdomain", port=24007, family=2, 
dnscache=0x1715748, addr_info=0x7f861b662930) at 
/root/glusterfs/libglusterfs/src/common-utils.c:240
#5  0x00007f86220594c3 in af_inet_client_get_remote_sockaddr 
(this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8) 
at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:238
#6  0x00007f8622059eba in socket_client_get_remote_sockaddr 
(this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8, 
sa_family=0x7f861b662aa6) at 
/root/glusterfs/rpc/rpc-transport/socket/src/name.c:496
#7  0x00007f8622055c1b in socket_connect (this=0x17156d0, port=0) at 
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2914
#8  0x00007f862c46dfe1 in rpc_transport_connect (this=0x17156d0, port=0) 
at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:426
#9  0x00007f862c473655 in rpc_clnt_submit (rpc=0x1713c80, prog=0x614620 
<clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, 
proghdr=0x7f861b662cf0, proghdrcount=1, progpayload=0x0, 
progpayloadcount=0,
    iobref=0x7f85fc000f60, frame=0x7f862a513de0, rsphdr=0x0, 
rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at 
/root/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:1554
#10 0x000000000040d725 in mgmt_submit_request (req=0x7f861b663d60, 
frame=0x7f862a513de0, ctx=0x16cb010, prog=0x614620 <clnt_pmap_prog>, 
procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, xdrproc=0x4048d0 
<xdr_pmap_signout_req@plt>)
    at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:1445
#11 0x000000000040f38d in glusterfs_mgmt_pmap_signout (ctx=0x16cb010) at 
/root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:2258
#12 0x0000000000407903 in cleanup_and_exit (signum=15) at 
/root/glusterfs/glusterfsd/src/glusterfsd.c:1201
#13 0x0000000000408ecf in glusterfs_sigwaiter (arg=0x7fff49a90520) at 
/root/glusterfs/glusterfsd/src/glusterfsd.c:1761
#14 0x00007f862bc0e9d1 in start_thread () from ./lib64/libpthread.so.0
#15 0x00007f862b5788fd in clone () from ./lib64/libc.so.6

We should probably concentrate on fixing the most common
spurious failures soon, and look into the less common ones
later on.

I'll do some runs on release-3.6 soon too, as I suspect that'll
be useful.

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel