Re: glusterd crashing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the delay. Here is what's installed:
# rpm -qa | grep gluster
glusterfs-geo-replication-3.7.4-2.el6.x86_64
glusterfs-client-xlators-3.7.4-2.el6.x86_64
glusterfs-3.7.4-2.el6.x86_64
glusterfs-libs-3.7.4-2.el6.x86_64
glusterfs-api-3.7.4-2.el6.x86_64
glusterfs-fuse-3.7.4-2.el6.x86_64
glusterfs-server-3.7.4-2.el6.x86_64
glusterfs-cli-3.7.4-2.el6.x86_64

The cmd_history.log file is attached. 
In gluster.log I have filtered out a bunch of lines like the one below due to make them more readable. I had a node down for multiple days due to maintenance and another one went down due to a hardware failure during that time too.
[2015-10-01 00:16:09.643631] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-gv0-client-0: remote operation failed. Path: <gfid:31f17f8c-6c96-4440-88c0-f813b3c8d364> (31f17f8c-6c96-4440-88c0-f813b3c8d364) [No such file or directory]

I also filtered out a boat load of self heal lines like these two:
[2015-10-01 15:14:14.851015] I [MSGID: 108026] [afr-self-heal-metadata.c:56:__afr_selfheal_metadata_do] 0-gv0-replicate-0: performing metadata selfheal on f78a47db-a359-430d-a655-1d217eb848c3
[2015-10-01 15:14:14.856392] I [MSGID: 108026] [afr-self-heal-common.c:651:afr_log_selfheal] 0-gv0-replicate-0: Completed metadata selfheal on f78a47db-a359-430d-a655-1d217eb848c3. source=0 sinks=1


[root@eapps-gluster01 glusterfs]# cat glustershd.log |grep -v 'remote operation failed' |grep -v 'self-heal'
[2015-09-27 08:46:56.893125] E [rpc-clnt.c:201:call_bail] 0-glusterfs: bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x6 sent = 2015-09-27 08:16:51.742731. timeout = 1800 for 127.0.0.1:24007
[2015-09-28 12:54:17.524924] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (Connection reset by peer)
[2015-09-28 12:54:27.844374] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-09-28 12:57:03.485027] W [socket.c:588:__socket_rwv] 0-gv0-client-2: readv on 160.10.31.227:24007 failed (Connection reset by peer)
[2015-09-28 12:57:05.872973] E [socket.c:2278:socket_connect_finish] 0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection refused)
[2015-09-28 12:57:38.490578] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available)
[2015-09-28 12:57:49.054475] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-09-28 13:01:12.062960] W [glusterfsd.c:1219:cleanup_and_exit] (-->/lib64/libpthread.so.0() [0x3c65e07a51] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059b5] ) 0-: received signum (15), shutting down
[2015-09-28 13:01:12.981945] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/9a9819e90404187e84e67b01614bbe10.socket --xlator-option *replicate*.node-uuid=416d712a-06fc-4b3c-a92f-8c82145626ff)
[2015-09-28 13:01:13.009171] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-09-28 13:01:13.092483] I [graph.c:269:gf_add_cmdline_options] 0-gv0-replicate-0: adding option 'node-uuid' for volume 'gv0-replicate-0' with value '416d712a-06fc-4b3c-a92f-8c82145626ff'
[2015-09-28 13:01:13.100856] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2015-09-28 13:01:13.103995] I [MSGID: 114020] [client.c:2118:notify] 0-gv0-client-0: parent translators are ready, attempting connect on transport
[2015-09-28 13:01:13.114745] I [MSGID: 114020] [client.c:2118:notify] 0-gv0-client-1: parent translators are ready, attempting connect on transport
[2015-09-28 13:01:13.115725] I [rpc-clnt.c:1851:rpc_clnt_reconfig] 0-gv0-client-0: changing port to 49152 (from 0)
[2015-09-28 13:01:13.125619] I [MSGID: 114020] [client.c:2118:notify] 0-gv0-client-2: parent translators are ready, attempting connect on transport
[2015-09-28 13:01:13.132316] E [socket.c:2278:socket_connect_finish] 0-gv0-client-1: connection to 160.10.31.64:24007 failed (Connection refused)
[2015-09-28 13:01:13.132650] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-28 13:01:13.133322] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-0: Connected to gv0-client-0, attached to remote volume '/export/sdb1/gv0'.
[2015-09-28 13:01:13.133365] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-28 13:01:13.133782] I [MSGID: 108005] [afr-common.c:3998:afr_notify] 0-gv0-replicate-0: Subvolume 'gv0-client-0' came back up; going online.
[2015-09-28 13:01:13.133863] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-0: Server lk version = 1
Final graph:
+------------------------------------------------------------------------------+
  1: volume gv0-client-0
  2:     type protocol/client
  3:     option clnt-lk-version 1
  4:     option volfile-checksum 0
  5:     option volfile-key gluster/glustershd
  6:     option client-version 3.7.4
  7:     option process-uuid eapps-gluster01-65147-2015/09/28-13:01:12:970131-gv0-client-0-0-0
  8:     option fops-version 1298437
  9:     option ping-timeout 42
 10:     option remote-host eapps-gluster01.uwg.westga.edu
 11:     option remote-subvolume /export/sdb1/gv0
 12:     option transport-type socket
 13:     option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
 14:     option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
 15: end-volume
 16:
 17: volume gv0-client-1
 18:     type protocol/client
 19:     option ping-timeout 42
 20:     option remote-host eapps-gluster02.uwg.westga.edu
 21:     option remote-subvolume /export/sdb1/gv0
 22:     option transport-type socket
 23:     option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
 24:     option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
 25: end-volume
 26:
 27: volume gv0-client-2
 28:     type protocol/client
 29:     option ping-timeout 42
 30:     option remote-host eapps-gluster03.uwg.westga.edu
 31:     option remote-subvolume /export/sdb1/gv0
 32:     option transport-type socket
 33:     option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
 34:     option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
 35: end-volume
 36:
 37: volume gv0-replicate-0
 38:     type cluster/replicate
 39:     option node-uuid 416d712a-06fc-4b3c-a92f-8c82145626ff
 46:     subvolumes gv0-client-0 gv0-client-1 gv0-client-2
 47: end-volume
 48:
 49: volume glustershd
 50:     type debug/io-stats
 51:     subvolumes gv0-replicate-0
 52: end-volume
 53:
+------------------------------------------------------------------------------+
[2015-09-28 13:01:13.154898] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2015-09-28 13:01:13.155031] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-2: disconnected from gv0-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2015-09-28 13:01:13.155080] W [MSGID: 108001] [afr-common.c:4081:afr_notify] 0-gv0-replicate-0: Client-quorum is not met
[2015-09-29 08:11:24.728797] I [MSGID: 100011] [glusterfsd.c:1291:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2015-09-29 08:11:24.763338] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-09-29 12:50:41.915254] E [rpc-clnt.c:201:call_bail] 0-gv0-client-2: bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0xd91f sent = 2015-09-29 12:20:36.092734. timeout = 1800 for 160.10.31.227:24007
[2015-09-29 12:50:41.923550] W [MSGID: 114032] [client-handshake.c:1623:client_dump_version_cbk] 0-gv0-client-2: received RPC status error [Transport endpoint is not connected]
[2015-09-30 23:54:36.547979] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available)
[2015-09-30 23:54:46.812870] E [socket.c:2278:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-10-01 00:14:20.997081] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-10-01 00:15:36.770579] W [socket.c:588:__socket_rwv] 0-gv0-client-2: readv on 160.10.31.227:24007 failed (Connection reset by peer)
[2015-10-01 00:15:37.906708] E [socket.c:2278:socket_connect_finish] 0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection refused)
[2015-10-01 00:15:53.008130] W [glusterfsd.c:1219:cleanup_and_exit] (-->/lib64/libpthread.so.0() [0x3b91807a51] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059b5] ) 0-: received signum (15), shutting down
[2015-10-01 00:15:53.008697] I [timer.c:48:gf_timer_call_after] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x3e2) [0x3b9480f992] -->/usr/lib64/libgfrpc.so.0(__save_frame+0x76) [0x3b9480f046] -->/usr/lib64/libglusterfs.so.0(gf_timer_call_after+0x1b1) [0x3b93447881] ) 0-timer: ctx cleanup started
[2015-10-01 00:15:53.994698] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/9a9819e90404187e84e67b01614bbe10.socket --xlator-option *replicate*.node-uuid=416d712a-06fc-4b3c-a92f-8c82145626ff)
[2015-10-01 00:15:54.020401] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-10-01 00:15:54.086777] I [graph.c:269:gf_add_cmdline_options] 0-gv0-replicate-0: adding option 'node-uuid' for volume 'gv0-replicate-0' with value '416d712a-06fc-4b3c-a92f-8c82145626ff'
[2015-10-01 00:15:54.093004] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2015-10-01 00:15:54.098144] I [MSGID: 114020] [client.c:2118:notify] 0-gv0-client-0: parent translators are ready, attempting connect on transport
[2015-10-01 00:15:54.107432] I [MSGID: 114020] [client.c:2118:notify] 0-gv0-client-1: parent translators are ready, attempting connect on transport
[2015-10-01 00:15:54.115962] I [MSGID: 114020] [client.c:2118:notify] 0-gv0-client-2: parent translators are ready, attempting connect on transport
[2015-10-01 00:15:54.120474] E [socket.c:2278:socket_connect_finish] 0-gv0-client-1: connection to 160.10.31.64:24007 failed (Connection refused)
[2015-10-01 00:15:54.120639] I [rpc-clnt.c:1851:rpc_clnt_reconfig] 0-gv0-client-0: changing port to 49152 (from 0)
Final graph:
+------------------------------------------------------------------------------+
  1: volume gv0-client-0
  2:     type protocol/client
  3:     option ping-timeout 42
  4:     option remote-host eapps-gluster01.uwg.westga.edu
  5:     option remote-subvolume /export/sdb1/gv0
  6:     option transport-type socket
  7:     option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
  8:     option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
  9: end-volume
 10:
 11: volume gv0-client-1
 12:     type protocol/client
 13:     option ping-timeout 42
 14:     option remote-host eapps-gluster02.uwg.westga.edu
 15:     option remote-subvolume /export/sdb1/gv0
 16:     option transport-type socket
 17:     option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
 18:     option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
 19: end-volume
 20:
 21: volume gv0-client-2
 22:     type protocol/client
 23:     option ping-timeout 42
 24:     option remote-host eapps-gluster03.uwg.westga.edu
 25:     option remote-subvolume /export/sdb1/gv0
 26:     option transport-type socket
 27:     option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
 28:     option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
 29: end-volume
 30:
 31: volume gv0-replicate-0
 32:     type cluster/replicate
 33:     option node-uuid 416d712a-06fc-4b3c-a92f-8c82145626ff
 40:     subvolumes gv0-client-0 gv0-client-1 gv0-client-2
 41: end-volume
 42:
 43: volume glustershd
 44:     type debug/io-stats
 45:     subvolumes gv0-replicate-0
 46: end-volume
 47:
+------------------------------------------------------------------------------+
[2015-10-01 00:15:54.135650] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-10-01 00:15:54.136223] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-0: Connected to gv0-client-0, attached to remote volume '/export/sdb1/gv0'.
[2015-10-01 00:15:54.136262] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2015-10-01 00:15:54.136410] I [MSGID: 108005] [afr-common.c:3998:afr_notify] 0-gv0-replicate-0: Subvolume 'gv0-client-0' came back up; going online.
[2015-10-01 00:15:54.136500] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-0: Server lk version = 1
[2015-10-01 00:15:54.401702] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2015-10-01 00:15:54.401834] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-2: disconnected from gv0-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2015-10-01 00:15:54.401878] W [MSGID: 108001] [afr-common.c:4081:afr_notify] 0-gv0-replicate-0: Client-quorum is not met
[2015-10-01 03:57:52.755426] E [socket.c:2278:socket_connect_finish] 0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection refused)
[2015-10-01 13:50:49.000708] E [socket.c:2278:socket_connect_finish] 0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection timed out)
[2015-10-01 14:36:40.481673] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2015-10-01 14:36:40.481833] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-1: disconnected from gv0-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2015-10-01 14:36:41.982037] I [rpc-clnt.c:1851:rpc_clnt_reconfig] 0-gv0-client-1: changing port to 49152 (from 0)
[2015-10-01 14:36:41.993478] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-10-01 14:36:41.994568] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-1: Connected to gv0-client-1, attached to remote volume '/export/sdb1/gv0'.
[2015-10-01 14:36:41.994647] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2015-10-01 14:36:41.994899] I [MSGID: 108002] [afr-common.c:4077:afr_notify] 0-gv0-replicate-0: Client-quorum is met
[2015-10-01 14:36:42.002275] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-1: Server lk version = 1




Thanks,
Gene Liverman
Systems Integration Architect
Information Technology Services
University of West Georgia

ITS: Making Technology Work for You!



On Wed, Sep 30, 2015 at 10:54 PM, Gaurav Garg <ggarg@xxxxxxxxxx> wrote:
Hi Gene,

Could you paste or attach core file/glusterd log file/cmd history to find out actual RCA of the crash. What steps you performed for this crash.

>> How can I troubleshoot this?

If you want to troubleshoot this then you can look into the glusterd log file, core file.

Thank you..

Regards,
Gaurav

----- Original Message -----
From: "Gene Liverman" <gliverma@xxxxxxxxxx>
To: gluster-users@xxxxxxxxxxx
Sent: Thursday, October 1, 2015 7:59:47 AM
Subject: glusterd crashing

In the last few days I've started having issues with my glusterd service crashing. When it goes down it seems to do so on all nodes in my replicated volume. How can I troubleshoot this? I'm on a mix of CentOS 6 and RHEL 6. Thanks!



Gene Liverman
Systems Integration Architect
Information Technology Services
University of West Georgia
gliverma@xxxxxxxxxx


Sent from Outlook on my iPhone


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

Attachment: cmd_history.log
Description: Binary data

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux