Hi. I am new to glusterfs and tried some failover evaluation and experienced "Transport endpoint is not connected” error. Can someone guess about the status and tell how to recover ? Here’s the detail. The Gluster version is 4.1.1. The volume info is something like the following. There are 9 nodes (machines), running CentOS7.5, each of which have one brick. The gluster volume type is "Distributed Replicated” with 3 replications (one of the three is arbiter). # gluster volume info Volume Name: vol0 Type: Distributed-Replicate Volume ID: 2c450ca8-d385-43a3-8761-7d227ee61d37 Status: Started Snapshot Count: 0 Number of Bricks: 3 x (2 + 1) = 9 Transport-type: tcp Bricks: Brick1: host01:/glusterfs/vol0/brick0 Brick2: host02:/glusterfs/vol0/brick0 Brick3: host03:/glusterfs/vol0/brick0 (arbiter) Brick4: host04:/glusterfs/vol0/brick0 Brick5: host05:/glusterfs/vol0/brick0 Brick6: host06:/glusterfs/vol0/brick0 (arbiter) Brick7: host07:/glusterfs/vol0/brick0 Brick8: host08:/glusterfs/vol0/brick0 Brick9: host09:/glusterfs/vol0/brick0 (arbiter) Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off The evaluation consists of the following two simultaneous (independent) operations. (1) On a gluster native client, which is not any of the gluster server machines, files in a local filesystem are copy to directory under gluster mount point, read the copied files and delete them. Repeat this copy-read-delete procedure continuously. The number of files is 1000, each of the file size is 10MB. (2) Stop and restart gluster server OSs one after another. The period of stop-start is about 3 minutes. (At any time, no more than one server machine is stopped.) The stop-stat target OSs are sequentially selected. ( host01 -> host02 -> ... -> host09 -> host01 -> ... ) I continued the operation for long time and after more than 48 hours, the client started generating messages, to stderr, containing "Transport endpoint is not connected”. On the occasion, the following logs under /var/log/glusterfs/<mountpoint>.log are generated on the client. [2018-08-20 01:37:58.372075] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-vol0-replicate-0: Gfid mismatch detected for <gfid:fd685a4f-3b6a-4307-b7ef-79356b0802a2>/file990>, aca9c945-8a3b-4ffe-af45-d07c4cea355a on vol0-client-2 and 8518500f-fb45-4466-9878-10a25545aa48 on vol0-client-0. [2018-08-20 01:37:58.372234] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-vol0-replicate-0: Skipping conservative merge on the file. [2018-08-20 01:37:58.381751] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-vol0-replicate-0: Gfid mismatch detected for <gfid:fd685a4f-3b6a-4307-b7ef-79356b0802a2>/file998>, 0b401c35-70d7-4255-a4ca-71b27f3fd7e7 on vol0-client-2 and 45a14259-9d4f-43e3-8c55-4c6dc080a97c on vol0-client-0. [2018-08-20 01:37:58.381931] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-vol0-replicate-0: Skipping conservative merge on the file. ... repeat many many times (the string of ID are different) The only way, I found, to recover from this error is to remove all the associated files under the /glusterfs/vol0/brick0. I want to know about the status of errors, some methods to recover and some methods to avoid the status to happen. Thanks. P.S. As far as we tried, there’s no memory leak with the version. _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users