Hi Group, I'm having some difficulty debugging a problem (or maybe more than one problem) I have a replica 3 gluster file system. Basic setup is three machines, all CentOS 7 (up to date), gluster 3.8 (provided by the CentOS repos). XFS brick partitions. At a (so far random) point in time the gluster server daemons stop communicating with each other. The glusterfs processes for each volume on each server are still running though. If I restart them manually (systemctl restart glusterfsd glusterd) communication is once again established. (For me) there is no obvious log message that would point me to the source of the problem. In general all servers show a lot of different warnings and errors in the brick logs almost all the time clients are connected. All clients are Fedora 25 machines using the FUSE gluster client (3.8) From the brick log on the storage servers: It seems the quota sub system has some problems: [2017-05-22 11:41:36.323233] W [marker-quota.c:33:mq_loc_copy] 0-marker: src loc is not valid [2017-05-22 11:41:36.323269] E [marker-quota.c:1472:mq_initiate_quota_task] 0-home-marker: loc copy failed And there are a lot of warnings about xattrs like this: [2017-05-22 11:32:28.111240] W [MSGID: 113001] [posix.c:4212:posix_get_ancestry_non_directory] 0-home-posix: listxattr failed on/srv/gluster_home/brick/.glusterfs/34/06/3406a026-596b-40b3-8dd1-8186e3072031 [No such file or directory] And I get some of these: [2017-05-22 11:41:29.591258] E [MSGID: 115081] [server-rpc-fops.c:1201:server_fstat_cbk] 0-home-server: 923966: FSTAT -2 (197f90db-ff62-4436-b200-f4347b6c2ba0) ==> (Operation not permitted) [Operation not permitted] There are some warnings in the glustershd.log but none prior to the disconnect give me a clue as to what's going on: [2017-05-22 07:37:48.003835] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-home-client-1: remote operation failed. Path: <gfid:b8812b00-7bda-4871-ab39-82d1eb8a2607> (b8812b00-7bda-4871-ab39-82d1eb8a2607) [No such file or directory] [2017-05-22 07:37:48.003842] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-home-client-0: remote operation failed. Path: <gfid:b8812b00-7bda-4871-ab39-82d1eb8a2607> (b8812b00-7bda-4871-ab39-82d1eb8a2607) [No such file or directory] [2017-05-22 07:37:48.003896] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-home-client-2: remote operation failed. Path: <gfid:b8812b00-7bda-4871-ab39-82d1eb8a2607> (b8812b00-7bda-4871-ab39-82d1eb8a2607) [No such file or directory] [2017-05-22 11:14:36.832051] W [socket.c:590:__socket_rwv] 0-home-client-1: readv on X.X.X.8:49156 failed (No data available) [2017-05-22 11:14:36.832103] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-home-client-1: disconnected from home-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2017-05-22 11:14:36.832295] W [socket.c:590:__socket_rwv] 0-home-client-2: readv on X.X.X.9:49156 failed (No data available) [2017-05-22 11:14:36.832317] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-home-client-2: disconnected from home-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2017-05-22 11:14:36.832328] W [MSGID: 108001] [afr-common.c:4467:afr_notify] 0-home-replicate-0: Client-quorum is not met [2017-05-22 11:14:47.423671] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-home-client-1: changing port to 49156 (from 0) [2017-05-22 11:14:47.430133] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-home-client-2: changing port to 49156 (from 0) [2017-05-22 11:14:47.434348] E [socket.c:2309:socket_connect_finish] 0-home-client-1: connection to X.X.X.8:49156 failed (Connection refused) [2017-05-22 11:14:47.438925] E [socket.c:2309:socket_connect_finish] 0-home-client-2: connection to X.X.X.9:49156 failed (Connection refused) Gluster Volume Info: Volume Name: home Type: Replicate Volume ID: 47875706-1ed1-48b7-bc5d-0dca8ec5cd58 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: four:/srv/gluster_home/brick Brick2: five:/srv/gluster_home/brick Brick3: six:/srv/gluster_home/brick Options Reconfigured: server.root-squash: off network.ping-timeout: 10 cluster.quorum-type: auto transport.address-family: inet performance.readdir-ahead: on nfs.disable: on features.quota: on features.inode-quota: on features.quota-deem-statfs: on Installed gluster server (CentOS 7) version: glusterfs-server-3.8.11-1.el7.x86_64 Installed gluster client (Fedora 25) version: glusterfs-3.8.9-1.fc24.x86_64 Does anybody have an idea why the communication might be interrupted? And also why there a so many warnings about extended attributes of non existing files? I'm running basically the same setup with CentOS clients for oVirt without any gluster warnings or errors at all. Currently I'm at a loss. Any help is highly appreciated! Thanks Richard Neuboeck -- --------------------------------------------------------------------- [a] Department for Theoretical Chemistry University of Vienna Waehringer Strasse 17/3/304, 1090 Wien, Austria [p] +43 1 4277 52735 [m] hawk@xxxxxxxxxxxxxxxx ---------------------------------------------------------------------
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users