Re: Geo-replication status is getting Faulty after few seconds

Aravinda <aravinda@xxxxxxxxxxx> · Wed, 31 Jan 2024 16:44:27 +0530

Hi Anant,

You need to run the gsec_create command when a new node is added to Primary or secondary

gluster system:: execute gsec_create
gluster volume geo-replication tier1data drtier1data::drtier1data create push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status

Or use the Geo-rep setup tool fix the key related issues and resetup (https://github.com/aravindavk/gluster-georep-tools)

gluster-georep-setup tier1data drtier1data::drtier1data

--
Aravinda
Kadalu Technologies

---- On Wed, 31 Jan 2024 02:49:08 +0530 Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx> wrote ---

Hi All,

As per the documentation, if we use `delete` only it will start the replication from
 the time where it was left before deleting the session, So I tried that without any luck.

gluster volume geo-replication tier1data drtier1data::drtier1data delete
gluster volume geo-replication tier1data drtier1data::drtier1data create push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status

I have tried to check the drtier1data logs as well, and all I can see is master1 connects to drtier1data and
 send disconnect after 5 seconds, please check the following logs from drtier1data.

[2024-01-30 21:04:03.016805 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk]
 0-drtier1data-client-0: Connected, attached to remote volume [{conn-name=drtier1data-client-0}, {remote_subvol=/opt/tier1data2019/brick}] 
[2024-01-30 21:04:03.020148 +0000]
 I [fuse-bridge.c:5296:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.33
[2024-01-30 21:04:03.020197 +0000]
 I [fuse-bridge.c:5924:fuse_graph_sync] 0-fuse: switched to graph 0
[2024-01-30 21:04:08.573873 +0000]
 I [fuse-bridge.c:6233:fuse_thread_proc] 0-fuse: initiating unmount of /tmp/gsyncd-aux-mount-c8c41k2k
[2024-01-30 21:04:08.575131 +0000]
 W [glusterfsd.c:1429:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x817a) [0x7fb907e2e17a] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55f97b17dbfd] -->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x55f97b17da48] ) 0-: received signum (15), shutting
 down 
[2024-01-30 21:04:08.575227 +0000]
 I [fuse-bridge.c:7063:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-c8c41k2k'.
[2024-01-30 21:04:08.575256 +0000]
 I [fuse-bridge.c:7068:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-c8c41k2k'.

Can anyone suggest how can I find the reason of getting these disconnect requests from master1 or what shall
 I check next?

Many thanks,
A

From: Gluster-users <gluster-users-bounces@xxxxxxxxxxx> on behalf of Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
 Sent: 30 January 2024 2:14 PM
 To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx>
 Subject: Re:  Geo-replication status is getting Faulty after few seconds  
EXTERNAL: Do not click links or open attachments if you do not recognize the sender.
Hello Everyone, 

I am looking for some help. Can anyone please suggest
 if it's possible to promote a master node to be the primary in the geo-replication session?

We have three master nodes and one secondary node.
 We are facing issues where geo-replication is consistently failing from the primary master node. We want to check if it works fine from another master node.

Any guidance or assistance would be highly appreciated.

Many thanks,
Anant

From: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
 Sent: 29 January 2024 3:55 PM
 To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx>
 Subject: Re:  Geo-replication status is getting Faulty after few seconds  
Hi  @Strahil Nikolov,

We have been running this geo-replication for more than 5 years and it was working
 fine till last week, So I think it shouldn't be something which was missed in the initial setup, but I am unable to understand why it's not working now.

I have enabled SSH Debug on the secondary node(drtier1data), and I can see this in
 the logs.

Jan 29 14:25:52 drtier1data sshd[1268110]: debug1: server_input_channel_req:
 channel 0 request exec reply 1
Jan 29 14:25:52 drtier1data sshd[1268110]: debug1: session_by_channel: session
 0 channel 0
Jan 29 14:25:52 drtier1data sshd[1268110]: debug1: session_input_channel_req:
 session 0 req exec
Jan 29 14:25:52 drtier1data sshd[1268110]: Starting session: command for root
 from XX.236.28.58 port 53082 id 0
Jan 29 14:25:52 drtier1data sshd[1268095]: debug1: session_new: session 0
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: Received SIGCHLD.
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: session_by_pid: pid 1268111
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: session_exit_message: session
 0 channel 0 pid 1268111
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: session_exit_message: release
 channel 0
Jan 29 14:25:58 drtier1data sshd[1268110]: Received disconnect from XX.236.28.58
 port 53082:11: disconnected by user
Jan 29 14:25:58 drtier1data sshd[1268110]: Disconnected from user root XX.236.28.58
 port 53082
Jan 29 14:25:58 drtier1data sshd[1268110]: debug1: do_cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: do_cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: PAM: cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: PAM: closing session
Jan 29 14:25:58 drtier1data sshd[1268095]: pam_unix(sshd:session): session closed
 for user root
Jan 29 14:25:58 drtier1data sshd[1268095]: debug1: PAM: deleting credentials

As per the above logs, drtier1data node is getting SIGCHLD from
 master1. (Received disconnect from XX.236.28.58 port 53082:11:
 disconnected by user)

Also, I have checked the gsyncd.log on master1, which says "SSH: SSH connection between
 master and slave established. [{duration=1.7277}]", which means passwordless ssh is working fine.

As per my understanding, Master1 can connect to the drtier1data server, and then the
 geo-replication status changes to Active --> History Crawl and then something happens on the master1 which triggers the SSH disconnect.

is it possible to change the master node in geo-replication so that we can mark master2
 as primary, instead of master1?

I am really struggling to fix this issue, Please help, any pointer is appreciated !!!

Many thanks,
Anant
From: Gluster-users <gluster-users-bounces@xxxxxxxxxxx> on behalf of Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
 Sent: 29 January 2024 12:20 AM
 To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx>
 Subject: Re:  Geo-replication status is getting Faulty after few seconds  
EXTERNAL: Do not click links or open attachments if you do not recognize the sender.
HI Strahil,

As mentioned in my last email, I have copied the gluster public key from master3 to secondary server, and I can now ssh from all master nodes to secondary server, but still getting the same error.

[root@master1 geo-replication]# ssh root@drtier1data -i /var/lib/glusterd/geo-replication/secret.pem
Last login: Mon Jan 29 00:14:32 2024 from 
[root@drtier1data ~]# 

[root@master2 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root@drtier1data
Last login: Mon Jan 29 00:02:34 2024 from 
[root@drtier1data ~]# 

[root@master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root@drtier1data 
Last login: Mon Jan 29 00:14:41 2024 from 
[root@drtier1data ~]#

Thanks,
Anant
From: Strahil Nikolov <hunter86_bg@xxxxxxxxx>
 Sent: 28 January 2024 10:07 PM
 To: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>; gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>
 Subject: Re:  Geo-replication status is getting Faulty after few seconds  
EXTERNAL: Do
 not click links or open attachments if you do not recognize the sender.
Gluster doesn't use the ssh key in /root/.ssh, thus you need to exchange the public
 key that corresponds to /var/lib/glusterd/geo-replication/secret.pem . If you don't know the pub key, google how to obtain it from the private key.

Ensure that all hosts can ssh to the secondary before proceeding with the troubleshooting.

Best Regards,
Strahil Nikolov

On Sun, Jan 28, 2024 at 15:58, Anant Saraswat
<anant.saraswat@xxxxxxxxxxxxxx> wrote:
Hi All,

I have now copied  /var/lib/glusterd/geo-replication/secret.pem.pub  (public key) from master3 to drtier1data /root/.ssh/authorized_keys, and now I can ssh from master node3 to drtier1data using the georep key (/var/lib/glusterd/geo-replication/secret.pem).

But I am still getting the same error, and geo-replication is getting faulty again and
 again.

[2024-01-28 13:46:38.897683] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449598}]
[2024-01-28 13:46:38.922491] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:46:38.923127] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:46:38.923313] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449598},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:46:39.973584] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:46:40.98970] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:46:40.757691] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:46:40.766860] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:46:50.793311] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:46:50.793469] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:46:50.874474] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:46:52.659114] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7844}]
[2024-01-28 13:46:52.659461] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-28 13:46:53.698769] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0392}]
[2024-01-28 13:46:53.698984] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-28 13:46:55.831999] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:46:55.832354] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449615}]
[2024-01-28 13:46:55.854684] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:46:55.855251] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:46:55.855419] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449615},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:46:56.905496] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:46:57.38262] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:46:57.704128] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:46:57.706743] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:07.741438] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:07.741582] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:07.821284] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:47:09.573661] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7521}]
[2024-01-28 13:47:09.573955] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-28 13:47:10.612173] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0381}]
[2024-01-28 13:47:10.612359] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-28 13:47:12.751856] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:47:12.752237] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449632}]
[2024-01-28 13:47:12.759138] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:47:12.759690] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:47:12.759868] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449632},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:47:13.810321] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:47:13.924068] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:47:14.617663] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:47:14.620035] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:24.646013] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:24.646157] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:24.725510] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:47:26.491939] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7662}]
[2024-01-28 13:47:26.492235] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-28 13:47:27.530852] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0385}]
[2024-01-28 13:47:27.531036] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-28 13:47:29.670099] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:47:29.670640] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449649}]
[2024-01-28 13:47:29.696144] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:47:29.696709] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:47:29.696899] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449649},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:47:30.751127] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:47:30.885824] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:47:31.535252] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:47:31.538450] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:41.564276] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:41.564426] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:41.645110] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:47:43.435830] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7904}]
[2024-01-28 13:47:43.436285] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-28 13:47:44.475671] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0393}]
[2024-01-28 13:47:44.475865] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-28 13:47:46.630478] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:47:46.630924] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449666}]
[2024-01-28 13:47:46.655069] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:47:46.655752] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:47:46.655926] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449666},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:47:47.706875] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:47:47.834996] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:47:48.480822] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:47:48.491306] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:47:58.518263] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:47:58.518412] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:47:58.601096] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:48:00.355000] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7537}]
[2024-01-28 13:48:00.355345] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-28 13:48:01.395025] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0396}]
[2024-01-28 13:48:01.395212] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-28 13:48:03.541059] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:48:03.541481] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449683}]
[2024-01-28 13:48:03.567552] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:48:03.568172] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:48:03.568376] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449683},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:48:04.621488] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:48:04.742268] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:48:04.919335] I [master(worker /opt/tier1data2019/brick):2013:syncjob] Syncer: Sync Time Taken [{job=3}, {num_files=10}, {return_code=3}, {duration=0.0180}]
[2024-01-28 13:48:04.919919] E [syncdutils(worker /opt/tier1data2019/brick):847:errlog] Popen: command returned error [{cmd=rsync -aR0 --inplace --files-from=- --super --stats
 --numeric-ids --no-implied-dirs --existing --xattrs --acls --ignore-missing-args . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-zo_ev6yu/75785990b3233f5dbbab9f43cc3ed895.sock
 drtier1data:/proc/799165/cwd}, {error=3}]
[2024-01-28 13:48:05.399226] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:48:05.403931] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:48:15.430175] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:48:15.430308] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:48:15.510770] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:48:17.240311] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7294}]
[2024-01-28 13:48:17.240509] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-28 13:48:18.279007] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0384}]
[2024-01-28 13:48:18.279195] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-28 13:48:20.455937] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-28 13:48:20.456274] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706449700}]
[2024-01-28 13:48:20.464288] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-28 13:48:20.464807] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-28 13:48:20.464970] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706449700},
 {entry_stime=(1705935991, 0)}]
[2024-01-28 13:48:21.514201] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-28 13:48:21.644609] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-28 13:48:22.284920] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-28 13:48:22.286189] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2024-01-28 13:48:32.312378] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
[2024-01-28 13:48:32.312526] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]
[2024-01-28 13:48:32.393484] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-28 13:48:34.91825] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.6981}]
[2024-01-28 13:48:34.92130] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...

Thanks,
Anant

From: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
 Sent: 28 January 2024 1:33 AM
 To: Strahil Nikolov <hunter86_bg@xxxxxxxxx>; gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>
 Subject: Re:  Geo-replication status is getting Faulty after few seconds  
Hi  @Strahil Nikolov,

I have checked the ssh connection from all the master servers and I can ssh drtier1data from master1 and master2 server(old
 master servers), but I am unable to ssh drtier1data from
 master3 (new node).

[root@master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root@drtier1data
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 325, in <module>
    main()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 259, in main
    if args.subcmd in ("worker"):
TypeError: 'in <string>' requires string as left operand, not NoneType
Connection to drtier1data closed.

But I am able to ssh  drtier1data from
 master3 without using the georep key.

[root@master3 ~]# ssh  root@drtier1data
Last login: Sun Jan 28 01:16:25 2024 from 87.246.74.32
[root@drtier1data ~]# 

Also, today I restarted the gluster server on master1 as geo-replication is trying to be active from master1 server, and sometimes I
 am getting the following error in gsyncd.log

[2024-01-28 01:27:24.722663] E [syncdutils(worker /opt/tier1data2019/brick):847:errlog] Popen: command returned error [{cmd=rsync
 -aR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs --existing --xattrs --acls --ignore-missing-args . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto
 -S /tmp/gsyncd-aux-ssh-0exuoeg7/75785990b3233f5dbbab9f43cc3ed895.sock drtier1data:/proc/553418/cwd}, {error=3}]

Many thanks,
Anant
From: Strahil Nikolov <hunter86_bg@xxxxxxxxx>
 Sent: 27 January 2024 5:25 AM
 To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
 Subject: Re:  Geo-replication status is getting Faulty after few seconds  
EXTERNAL: Do not click links or open attachments if you do not recognize the sender.

 Don't forget to test with the georep key. I think it was /var/lib/glusterd/geo-replication/secret.pem

 Best Regards,
 Strahil Nikolov

 В събота, 27 януари 2024 г. в 07:24:07 ч. Гринуич+2, Strahil Nikolov <hunter86_bg@xxxxxxxxx> написа:

 Hi Anant,

 i would first start checking if you can do ssh from all masters to the slave node.If you haven't setup a dedicated user for the session, then gluster is using root.

 Best Regards,
 Strahil Nikolov

 В петък, 26 януари 2024 г. в 18:07:59 ч. Гринуич+2, Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx> написа:

 Hi All,

 I have run the following commands on master3, and that has added master3 to geo-replication.

 gluster system:: execute gsec_create

 gluster volume geo-replication tier1data drtier1data::drtier1data create push-pem force

 gluster volume geo-replication tier1data drtier1data::drtier1data stop

 gluster volume geo-replication tier1data drtier1data::drtier1data start

 Now I am able to start the geo-replication, but I am getting the same error.

 [2024-01-24 19:51:24.80892] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]

 [2024-01-24 19:51:24.81020] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/opt/tier1data2019/brick}, {slave_node=drtier1data}]

 [2024-01-24 19:51:24.158021] I [resource(worker /opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...

 [2024-01-24 19:51:25.951998] I [resource(worker /opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.7938}]

 [2024-01-24 19:51:25.952292] I [resource(worker /opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...

 [2024-01-24 19:51:26.986974] I [resource(worker /opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0346}]

 [2024-01-24 19:51:26.987137] I [subcmds(worker /opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor

 [2024-01-24 19:51:29.139131] I [master(worker /opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]

 [2024-01-24 19:51:29.139531] I [resource(worker /opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706125889}]

 [2024-01-24 19:51:29.173877] I [gsyncdstatus(worker /opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]

 [2024-01-24 19:51:29.174407] I [gsyncdstatus(worker /opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]

 [2024-01-24 19:51:29.174558] I [master(worker /opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706125889}, {entry_stime=(1705935991, 0)}]

 [2024-01-24 19:51:30.251965] I [master(worker /opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]

 [2024-01-24 19:51:30.376715] E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]

 [2024-01-24 19:51:30.991856] I [monitor(monitor):228:monitor] Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]

 [2024-01-24 19:51:30.993608] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]

 Any idea why it's stuck in this loop?

 Thanks,

 Anant

 ________________________________
 From: Gluster-users <gluster-users-bounces@xxxxxxxxxxx> on behalf of Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
 Sent: 22 January 2024 9:00 PM
 To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>
 Subject: [Gluster-users] Geo-replication status is getting Faulty after few seconds

 EXTERNAL: Do not click links or open attachments if you do not recognize the sender.

 Hi There,

 We have a Gluster setup with three master nodes in replicated mode and one slave node with geo-replication.

 # gluster volume info

 Volume Name: tier1data

 Type: Replicate

 Volume ID: 93c45c14-f700-4d50-962b-7653be471e27

 Status: Started

 Snapshot Count: 0

 Number of Bricks: 1 x 3 = 3

 Transport-type: tcp

 Bricks:

 Brick1: master1:/opt/tier1data2019/brick

 Brick2: master2:/opt/tier1data2019/brick

 Brick3: master3:/opt/tier1data2019/brick

 master1 |master2 |  ------------------------------geo-replication----------------------------- | drtier1datamaster3 |

 We added the master3 node a few months back, the initial setup consisted of 2 master nodes and one geo-replicated slave(drtier1data).

 Our geo-replication was functioning well with the initial two master nodes (master1 and master2), where master1 was active and master2 was in passive mode. However, today, we started experiencing issues where geo-replication suddenly stopped and became stuck
 in a loop of Initializing..., Active.. Faulty on master1, while master2 remained in passive mode.

 Upon checking the gsyncd.log on the master1 node, we observed the following error (please refer to the attached logs for more details):

 E [syncdutils(worker /opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]

 # gluster volume geo-replication tier1data status

 MASTER NODE            MASTER VOL    MASTER BRICK                SLAVE USER    SLAVE                                            SLAVE NODE    STATUS             CRAWL STATUS    LAST_SYNCED

 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 master1    tier1data     /opt/tier1data2019/brick    root          ssh://drtier1data::drtier1data    N/A           Faulty           N/A             N/A

 master2    tier1data     /opt/tier1data2019/brick    root          ssh://drtier1data::drtier1data                  Passive            N/A             N/A

 Suspecting an issue on the drtier1data(slave), I attempted to restart Gluster on the slave node, also tried to restart drtier1data server without any luck.

 After that I tried the following command to get the Primary-log-file for geo-replication on master1, and got the following error.

 # gluster volume geo-replication tier1data drtier1data::drtier1data config log-file

 Staging failed on master3. Error: Geo-replication session between tier1data and drtier1data::drtier1data does not exist.

 geo-replication command failed

 Master3 was the new node added a few months back, but geo-replication was working until today, and we never added this node under geo-replication.

 After that, I forcefully stopped the geo-replication, thinking that restarting geo-replication might fix the issue. However, now the geo-replication is not starting and is giving the same error.

 # gluster volume geo-replication tier1data drtier1data::drtier1data start force

 Staging failed on master3. Error: Geo-replication session between tier1data and drtier1data::drtier1data does not exist.

 geo-replication command failed

 Can anyone please suggest what I should do next to resolve this issue? As there is 5TB of data in this volume, I don't want to resync the entire data to drtier1data. Instead, I want to resume the sync from where it last stopped.

 Thanks in advance for any guidance/help.

 Kind regards,

 Anant

 DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender. This message contains confidential
 information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Thanks for your cooperation.

 DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender. This message contains confidential
 information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Thanks for your cooperation.

 ________

 Community Meeting Calendar:

 Schedule -
 Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
 Bridge:  https://urldefense.com/v3/__https://meet.google.com/cpu-eiue-hvk__;!!I_DbfM1H!FIFMVBFvoomIXp1pMhjtLbD-1B_qztpAUPBHP5MST7a1hcf3FP8o6GkbQwzQUnS2nT_YIQ1MF7GV_PtM0CAQoOCo4VSgaw$
 Gluster-users mailing list
 Gluster-users@xxxxxxxxxxx
 https://urldefense.com/v3/__https://lists.gluster.org/mailman/listinfo/gluster-users__;!!I_DbfM1H!FIFMVBFvoomIXp1pMhjtLbD-1B_qztpAUPBHP5MST7a1hcf3FP8o6GkbQwzQUnS2nT_YIQ1MF7GV_PtM0CAQoOBrmvmlMg$
DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender. This message contains confidential
 information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Thanks for your cooperation.
DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender. This message contains confidential
 information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this
 email from your system.

 If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Thanks for your cooperation.

DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please
 notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if
 you have received this email by mistake and delete this email from your system. 

 If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Thanks for your cooperation.
________

Community Meeting Calendar: 

Schedule - 
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC 
Bridge: https://meet.google.com/cpu-eiue-hvk 
Gluster-users mailing list 
Gluster-users@xxxxxxxxxxx 
https://lists.gluster.org/mailman/listinfo/gluster-users 

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users