Hi all, our systems have suffered a node failure in a replica three setup. The node needed a complete reinstall. I followed the RH guide to replace a host with the same hostname (https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts). The machine has the same OS (CentOS 7). The new machine got a minor version number newer gluster packages (glusterfs-3.12.6-1.el7.x86_64) than the others (glusterfs-3.12.5-2.el7.x86_64). The guide told me to create /var/lib/glusterd/glusterd.info with the UUID from the old node. Then I copied /var/lib/glusterd/peers/<uuid> files from the two other nodes to the new (except the uuid file from the old host). I created all the brick directories as present on the other machines. Empty of course. And I set the volume-id extended attribute to the value retrieved from the running nodes. On one of the old nodes I mounted each export, created and removed a directory, set and removed an extended attribute as the guide suggested to trigger self healing. After that I started the gluster daemon (systemctl start glusterd glusterfsd). The new node list other peers as connected (and vice versa) but no brick processes are started. So the replacement bricks are not in use and no healing is done. I checked the logs and searched online but couldn't find a reason why the brick processes are not running or how to get them running. Is there a way to get the brick processes started? (Preferably not shutting down the other nodes since they are in use) Does anyone have a different approach to replace a faulty node? Thanks in advance! Cheers Richard Here is the glusterd.log. I've seen the disconnect messages but no reason why. /var/log/glusterd.log [2018-03-20 13:34:01.333423] I [MSGID: 100030] [glusterfsd.c:2524:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.12.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2018-03-20 13:34:01.339203] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2018-03-20 13:34:01.339243] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2018-03-20 13:34:01.339256] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2018-03-20 13:34:01.343809] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.12.6/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2018-03-20 13:34:01.343836] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine [2018-03-20 13:34:01.343847] W [rpcsvc.c:1682:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2018-03-20 13:34:01.343855] E [MSGID: 106243] [glusterd.c:1769:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2018-03-20 13:34:01.344594] I [MSGID: 106228] [glusterd.c:499:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory] [2018-03-20 13:34:01.344936] I [MSGID: 106513] [glusterd-store.c:2241:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31202 [2018-03-20 13:34:01.471227] I [MSGID: 106498] [glusterd-handler.c:3603:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2018-03-20 13:34:01.471297] I [MSGID: 106498] [glusterd-handler.c:3603:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2018-03-20 13:34:01.471325] W [MSGID: 106062] [glusterd-handler.c:3400:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2018-03-20 13:34:01.471351] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:01.471412] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-management: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction [2018-03-20 13:34:01.474137] W [MSGID: 106062] [glusterd-handler.c:3400:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2018-03-20 13:34:01.474161] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:01.474238] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-management: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction [2018-03-20 13:34:01.476646] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: e4ed3102-9794-494b-af36-d767d8a72678 Final graph: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option transport.listen-backlog 10 7: option rpc-auth-allow-insecure on 8: option event-threads 1 9: option ping-timeout 0 10: option transport.socket.read-fail-log off 11: option transport.socket.keepalive-interval 2 12: option transport.socket.keepalive-time 10 13: option transport-type rdma 14: option working-directory /var/lib/glusterd 15: end-volume 16: +------------------------------------------------------------------------------+ [2018-03-20 13:34:01.476895] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2018-03-20 13:34:12.197917] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 0acd0bff-c38f-4c49-82da-4112d22dfd2c, host: borg-sphere-three, port: 0 [2018-03-20 13:34:12.198929] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume engine. Starting local bricks. [2018-03-20 13:34:12.199166] I [glusterd-utils.c:5941:glusterd_brick_start] 0-management: starting a fresh brick process for brick /srv/gluster_engine/brick [2018-03-20 13:34:12.202498] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:12.208389] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume export. Starting local bricks. [2018-03-20 13:34:12.208622] I [glusterd-utils.c:5941:glusterd_brick_start] 0-management: starting a fresh brick process for brick /srv/gluster_export/brick [2018-03-20 13:34:12.211426] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:12.216722] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume iso. Starting local bricks. [2018-03-20 13:34:12.216906] I [glusterd-utils.c:5941:glusterd_brick_start] 0-management: starting a fresh brick process for brick /srv/gluster_iso/brick [2018-03-20 13:34:12.219439] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:12.224400] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume plexus. Starting local bricks. [2018-03-20 13:34:12.224555] I [glusterd-utils.c:5941:glusterd_brick_start] 0-management: starting a fresh brick process for brick /srv/gluster_plexus/brick [2018-03-20 13:34:12.226902] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:12.231689] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2018-03-20 13:34:12.231986] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2018-03-20 13:34:12.232047] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is stopped [2018-03-20 13:34:12.232082] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2018-03-20 13:34:12.232165] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2018-03-20 13:34:12.238970] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 3554 [2018-03-20 13:34:13.239224] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd service is stopped [2018-03-20 13:34:13.239365] I [MSGID: 106567] [glusterd-svc-mgmt.c:197:glusterd_svc_start] 0-management: Starting glustershd service [2018-03-20 13:34:14.243040] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2018-03-20 13:34:14.243817] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2018-03-20 13:34:14.243866] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: quotad service is stopped [2018-03-20 13:34:14.243928] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2018-03-20 13:34:14.244474] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2018-03-20 13:34:14.244514] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is stopped [2018-03-20 13:34:14.244589] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2018-03-20 13:34:14.245123] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2018-03-20 13:34:14.245169] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is stopped [2018-03-20 13:34:14.260266] I [glusterd-utils.c:5941:glusterd_brick_start] 0-management: starting a fresh brick process for brick /srv/gluster_navaar/brick [2018-03-20 13:34:14.263172] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-03-20 13:34:14.271938] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2018-03-20 13:34:14.272146] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2018-03-20 13:34:14.272366] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2018-03-20 13:34:14.272562] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2018-03-20 13:34:14.273000] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 0acd0bff-c38f-4c49-82da-4112d22dfd2c [2018-03-20 13:34:14.273770] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-03-20 13:34:14.273907] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 0acd0bff-c38f-4c49-82da-4112d22dfd2c [2018-03-20 13:34:14.277313] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-03-20 13:34:14.280409] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick borg-sphere-two:/srv/gluster_engine/brick has disconnected from glusterd. [2018-03-20 13:34:14.283608] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-03-20 13:34:14.286608] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick borg-sphere-two:/srv/gluster_export/brick has disconnected from glusterd. [2018-03-20 13:34:14.289765] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-03-20 13:34:14.292523] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick borg-sphere-two:/srv/gluster_iso/brick has disconnected from glusterd. [2018-03-20 13:34:14.295494] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-03-20 13:34:14.298261] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick borg-sphere-two:/srv/gluster_plexus/brick has disconnected from glusterd. [2018-03-20 13:34:14.298421] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 0e8b912a-bcff-4b33-88c6-428b3e658440, host: borg-sphere-one, port: 0 [2018-03-20 13:34:14.298935] I [glusterd-utils.c:5847:glusterd_brick_start] 0-management: discovered already-running brick /srv/gluster_engine/brick [2018-03-20 13:34:14.298958] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /srv/gluster_engine/brick on port 49152 [2018-03-20 13:34:14.299037] I [glusterd-utils.c:5847:glusterd_brick_start] 0-management: discovered already-running brick /srv/gluster_export/brick [2018-03-20 13:34:14.299051] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /srv/gluster_export/brick on port 49153 [2018-03-20 13:34:14.299117] I [glusterd-utils.c:5847:glusterd_brick_start] 0-management: discovered already-running brick /srv/gluster_iso/brick [2018-03-20 13:34:14.299130] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /srv/gluster_iso/brick on port 49154 [2018-03-20 13:34:14.299208] I [glusterd-utils.c:5847:glusterd_brick_start] 0-management: discovered already-running brick /srv/gluster_plexus/brick [2018-03-20 13:34:14.299223] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /srv/gluster_plexus/brick on port 49155 [2018-03-20 13:34:14.299292] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2018-03-20 13:34:14.299344] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is stopped [2018-03-20 13:34:14.299365] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2018-03-20 13:34:14.302501] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 3896 [2018-03-20 13:34:15.302703] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd service is stopped [2018-03-20 13:34:15.302798] I [MSGID: 106567] [glusterd-svc-mgmt.c:197:glusterd_svc_start] 0-management: Starting glustershd service [2018-03-20 13:34:15.305136] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2018-03-20 13:34:15.305172] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: quotad service is stopped [2018-03-20 13:34:15.305384] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2018-03-20 13:34:15.305406] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is stopped [2018-03-20 13:34:15.305599] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2018-03-20 13:34:15.305618] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is stopped [2018-03-20 13:34:15.323512] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-03-20 13:34:15.326856] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick borg-sphere-two:/srv/gluster_navaar/brick has disconnected from glusterd. [2018-03-20 13:34:15.329968] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 0e8b912a-bcff-4b33-88c6-428b3e658440 [2018-03-20 13:34:15.330024] I [MSGID: 106163] [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31202 [2018-03-20 13:34:15.335968] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 0acd0bff-c38f-4c49-82da-4112d22dfd2c [2018-03-20 13:34:15.336908] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to borg-sphere-three (0), ret: 0, op_ret: 0 [2018-03-20 13:34:15.340577] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /srv/gluster_engine/brick on port 49152 [2018-03-20 13:34:15.340669] E [socket.c:2369:socket_connect_finish] 0-management: connection to /var/run/gluster/7c88a1ced3d7819183c1b75562132753.socket failed (Connection reset by peer); disconnecting socket [2018-03-20 13:34:15.343472] E [socket.c:2369:socket_connect_finish] 0-management: connection to /var/run/gluster/92f05640572fdb863e0d3655821a9221.socket failed (Connection reset by peer); disconnecting socket [2018-03-20 13:34:15.346173] E [socket.c:2369:socket_connect_finish] 0-management: connection to /var/run/gluster/855c85c59ce6144e0cdaadc081dab574.socket failed (Connection reset by peer); disconnecting socket [2018-03-20 13:34:15.351476] W [socket.c:593:__socket_rwv] 0-management: readv on /var/run/gluster/2ac0088f40227ca69fb39d3c98e51d2d.socket failed (No data available) [2018-03-20 13:34:15.354084] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick borg-sphere-two:/srv/gluster_plexus/brick has disconnected from glusterd. [2018-03-20 13:34:15.354184] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /srv/gluster_plexus/brick on port 49155 [2018-03-20 13:34:15.354222] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 0acd0bff-c38f-4c49-82da-4112d22dfd2c [2018-03-20 13:34:15.354597] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-03-20 13:34:15.354645] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 0acd0bff-c38f-4c49-82da-4112d22dfd2c [2018-03-20 13:34:15.354670] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /srv/gluster_export/brick on port 49153 [2018-03-20 13:34:15.354789] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /srv/gluster_iso/brick on port 49154 [2018-03-20 13:34:15.354905] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 0e8b912a-bcff-4b33-88c6-428b3e658440 [2018-03-20 13:34:15.354927] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-03-20 13:34:15.355536] I [MSGID: 106163] [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31202 [2018-03-20 13:34:15.359667] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 0e8b912a-bcff-4b33-88c6-428b3e658440 [2018-03-20 13:34:15.360277] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to borg-sphere-one (0), ret: 0, op_ret: 0 [2018-03-20 13:34:15.361113] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 0e8b912a-bcff-4b33-88c6-428b3e658440 [2018-03-20 13:34:15.361151] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend -- /dev/null
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users