Hi
@Aravinda,
As advised, I have removed all the master server-related entries from the ".ssh/authorized_keys" file on the secondary node. Then, I ran the "ssh-copy-id root@drtier1data" command on all the masters for passwordless SSH and verified that I can access the drtier1data server from all the master nodes. After that, I ran the following commands.
gluster system:: execute gsec_create
gluster volume geo-replication tier1data drtier1data::drtier1data create push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status
[root@master3 ~]# gluster volume geo-replication tier1data drtier1data::drtier1data status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE
SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
master3 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data N/A
Initializing... N/A N/A
master1 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data N/A
Initializing... N/A N/A
master2 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data N/A
Initializing... N/A N/A
[root@master3 ~]# gluster volume geo-replication tier1data drtier1data::drtier1data status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE
SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
master3 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data
Passive N/A N/A
master1 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data
Active History Crawl N/A
master2 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data
Passive N/A N/A
[root@master3 ~]# gluster volume geo-replication tier1data drtier1data::drtier1data status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE
SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
master3 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data
Passive N/A N/A
master1 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data N/A
Faulty N/A N/A
master2 tier1data /opt/tier1data2019/brick root drtier1data::drtier1data
Passive N/A N/A
Still same thing happening, One thing I have noticed that everytime master1 is trying to be primary node, how
it's selected that which node will be primary node in the geo-replication?
Second, I have noticed
in
the ".ssh/authorized_keys" file on the secondry node, master2 and master3
have 2 entries (command="/usr/libexec/glusterfs/gsyncd" AND command="tar ${SSH_ORIGINAL_COMMAND#* }", but
master1 node have only one entry with command="tar ${SSH_ORIGINAL_COMMAND#* }" , This indicates that there is a missing entry for master1 with the command="/usr/libexec/glusterfs/gsyncd",
Does it make any sense?
Really appreciate your help on this.
Thanks,
Anant
From: Aravinda <aravinda@xxxxxxxxxxx>
Sent: 31 January 2024 5:27 PM To: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx> Cc: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx> Subject: Re: Geo-replication status is getting Faulty after few seconds EXTERNAL: Do not click links or open attachments if you do not recognize the sender. In one of the thread, I saw the SSH public key is manually copied to secondary node's authorized_keys file. Check if that entry starts with "command=" or not. If not added,
delete the entry in that file (or delete all Georep related entries in that file) and run Georep create push-pem force again.
--
Aravinda
Kadalu Technologies
---- On Wed, 31 Jan 2024 17:19:45 +0530 Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx> wrote ---
Hi
@Aravinda,
I used the exact same commands when I added master3 to the primary geo-replication. However, the issue is that the master3 node is in a passive state, and master1 is stuck in a loop of Initializing -> Active -> Faulty. It never considers master2 or master3 as the primary master for geo-replication.
If master1 can connect to the secondary (drtier1data) server, and I see the following message in the master1 logs which says "SSH connection between master and slave established.", do you still think it could be related to key issues? I am willing to rerun the commands from master1 if you advise. 2024-01-30 23:33:14.274611] I [resource(worker
/opt/tier1data2019/brick):1387:connect_remote] SSH: Initializing SSH connection between master and slave...
[2024-01-30 23:33:15.960004] I [resource(worker
/opt/tier1data2019/brick):1436:connect_remote] SSH: SSH connection between master and slave established. [{duration=1.6852}]
[2024-01-30 23:33:15.960300] I [resource(worker
/opt/tier1data2019/brick):1116:connect] GLUSTER: Mounting gluster volume locally...
[2024-01-30 23:33:16.995715] I [resource(worker
/opt/tier1data2019/brick):1139:connect] GLUSTER: Mounted gluster volume [{duration=1.0353}]
[2024-01-30 23:33:16.995905] I [subcmds(worker
/opt/tier1data2019/brick):84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to monitor
[2024-01-30 23:33:19.154376] I [master(worker
/opt/tier1data2019/brick):1662:register] _GMaster: Working dir [{path=/var/lib/misc/gluster/gsyncd/tier1data_drtier1data_drtier1data/opt-tier1data2019-brick}]
[2024-01-30 23:33:19.154759] I [resource(worker
/opt/tier1data2019/brick):1292:service_loop] GLUSTER: Register time [{time=1706657599}]
[2024-01-30 23:33:19.191343] I [gsyncdstatus(worker
/opt/tier1data2019/brick):281:set_active] GeorepStatus: Worker Status Change [{status=Active}]
[2024-01-30 23:33:19.191940] I [gsyncdstatus(worker
/opt/tier1data2019/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change [{status=History Crawl}]
[2024-01-30 23:33:19.192105] I [master(worker
/opt/tier1data2019/brick):1576:crawl] _GMaster: starting history crawl [{turns=1}, {stime=(1705935991, 0)}, {etime=1706657599}, {entry_stime=(1705935991, 0)}]
[2024-01-30 23:33:20.269529] I [master(worker
/opt/tier1data2019/brick):1605:crawl] _GMaster: slave's time [{stime=(1705935991, 0)}]
[2024-01-30 23:33:20.385018] E [syncdutils(worker
/opt/tier1data2019/brick):346:log_raise_exception] <top>: Gluster Mount process exited [{error=ENOTCONN}]
[2024-01-30 23:33:21.674] I [monitor(monitor):228:monitor]
Monitor: worker died in startup phase [{brick=/opt/tier1data2019/brick}]
[2024-01-30 23:33:21.11514] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change [{status=Faulty}]
Many thanks,
Anant
From: Aravinda <aravinda@xxxxxxxxxxx>
Sent: 31 January 2024 11:14 AM To: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx> Cc: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx> Subject: Re: Geo-replication status is getting Faulty after few seconds EXTERNAL: Do not click links or open attachments if you do not recognize the sender. Hi Anant,
You need to run the gsec_create command when a new node is added to Primary or secondary
gluster system:: execute gsec_create
gluster volume geo-replication tier1data drtier1data::drtier1data create push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data start
gluster volume geo-replication tier1data drtier1data::drtier1data status
Or use the Geo-rep setup tool fix the key related issues and resetup (https://github.com/aravindavk/gluster-georep-tools)
gluster-georep-setup tier1data drtier1data::drtier1data
--
gluster system:: execute gsec_create
gluster system:: execute gsec_create Kadalu Technologies
---- On Wed, 31 Jan 2024 02:49:08 +0530 Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx> wrote
---
Hi All,
As per the documentation, if we use `delete` only
it will start the replication from the time where it was left before deleting the session, So I tried that without any luck.
gluster volume geo-replication tier1data drtier1data::drtier1data
delete
gluster volume geo-replication tier1data drtier1data::drtier1data
create push-pem force
gluster volume geo-replication tier1data drtier1data::drtier1data
start
gluster volume geo-replication tier1data drtier1data::drtier1data
status
I have tried to check the drtier1data logs as
well, and all I can see is master1 connects to drtier1data and send disconnect after 5 seconds, please check the following logs from drtier1data.
[2024-01-30
21:04:03.016805 +0000] I [MSGID: 114046] [client-handshake.c:857:client_setvolume_cbk] 0-drtier1data-client-0: Connected, attached to remote volume [{conn-name=drtier1data-client-0}, {remote_subvol=/opt/tier1data2019/brick}]
[2024-01-30 21:04:03.020148
+0000] I [fuse-bridge.c:5296:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.33
[2024-01-30 21:04:03.020197
+0000] I [fuse-bridge.c:5924:fuse_graph_sync] 0-fuse: switched to graph 0
[2024-01-30 21:04:08.573873
+0000] I [fuse-bridge.c:6233:fuse_thread_proc] 0-fuse: initiating unmount of /tmp/gsyncd-aux-mount-c8c41k2k
[2024-01-30 21:04:08.575131
+0000] W [glusterfsd.c:1429:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x817a) [0x7fb907e2e17a] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55f97b17dbfd] -->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x55f97b17da48] ) 0-: received signum (15),
shutting down
[2024-01-30 21:04:08.575227
+0000] I [fuse-bridge.c:7063:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-c8c41k2k'.
[2024-01-30 21:04:08.575256
+0000] I [fuse-bridge.c:7068:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-c8c41k2k'.
Can anyone suggest how can I find the reason of
getting these disconnect requests from master1 or what shall I check next?
Many thanks,
A
From: Gluster-users <gluster-users-bounces@xxxxxxxxxxx>
on behalf of Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
Sent: 30 January 2024 2:14 PM To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx> Subject: Re: Geo-replication status is getting Faulty after few seconds EXTERNAL: Do not click links or open attachments if you do not recognize the sender.
Hello Everyone,
I am looking for some help. Can anyone please suggest if it's possible to promote a master node to be the primary in the geo-replication session?
We have three master nodes and one secondary node. We are facing issues where geo-replication is consistently failing from the primary master node. We want to check if it works fine from another master node.
Any guidance or assistance would be highly appreciated.
Many thanks,
Anant
From: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
Sent: 29 January 2024 3:55 PM To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx> Subject: Re: Geo-replication status is getting Faulty after few seconds
Hi
@Strahil Nikolov,
We have been running this geo-replication for
more than 5 years and it was working fine till last week, So I think it shouldn't be something which was missed in the initial setup, but I am unable to understand why it's not working now.
I have enabled SSH Debug on the secondary node(drtier1data),
and I can see this in the logs.
Jan 29 14:25:52 drtier1data sshd[1268110]:
debug1: server_input_channel_req: channel 0 request exec reply 1
Jan 29 14:25:52 drtier1data sshd[1268110]:
debug1: session_by_channel: session 0 channel 0
Jan 29 14:25:52 drtier1data sshd[1268110]:
debug1: session_input_channel_req: session 0 req exec
Jan 29 14:25:52 drtier1data sshd[1268110]:
Starting session: command for root from XX.236.28.58 port 53082 id 0
Jan 29 14:25:52 drtier1data sshd[1268095]:
debug1: session_new: session 0
Jan 29 14:25:58 drtier1data sshd[1268110]:
debug1: Received SIGCHLD.
Jan 29 14:25:58 drtier1data sshd[1268110]:
debug1: session_by_pid: pid 1268111
Jan 29 14:25:58 drtier1data sshd[1268110]:
debug1: session_exit_message: session 0 channel 0 pid 1268111
Jan 29 14:25:58 drtier1data sshd[1268110]:
debug1: session_exit_message: release channel 0
Jan 29 14:25:58 drtier1data sshd[1268110]:
Received disconnect from XX.236.28.58 port 53082:11: disconnected by user
Jan 29 14:25:58 drtier1data sshd[1268110]:
Disconnected from user root XX.236.28.58 port 53082
Jan 29 14:25:58 drtier1data sshd[1268110]:
debug1: do_cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]:
debug1: do_cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]:
debug1: PAM: cleanup
Jan 29 14:25:58 drtier1data sshd[1268095]:
debug1: PAM: closing session
Jan 29 14:25:58 drtier1data sshd[1268095]:
pam_unix(sshd:session): session closed for user root
Jan 29 14:25:58 drtier1data sshd[1268095]:
debug1: PAM: deleting credentials
As per the above logs, drtier1data node is getting
SIGCHLD from
master1. (Received disconnect from XX.236.28.58 port 53082:11:
disconnected by user)
Also, I have checked the gsyncd.log on master1,
which says "SSH: SSH connection between master and slave established. [{duration=1.7277}]", which means passwordless ssh is working fine.
As per my understanding, Master1 can connect to
the drtier1data server, and then the geo-replication status changes to Active --> History Crawl and then something happens on the master1 which triggers the SSH disconnect.
is it possible to change the master node in geo-replication
so that we can mark master2 as primary, instead of master1?
I am really struggling to fix this issue, Please
help, any pointer is appreciated !!!
Many thanks,
Anant
From: Gluster-users <gluster-users-bounces@xxxxxxxxxxx>
on behalf of Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>
Sent: 29 January 2024 12:20 AM To: gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx>; Strahil Nikolov <hunter86_bg@xxxxxxxxx> Subject: Re: Geo-replication status is getting Faulty after few seconds EXTERNAL: Do not click links or open attachments if you do not recognize the sender. HI
Strahil,
As mentioned in my last email, I have copied the
gluster public key from master3 to secondary server, and I can now ssh from all master nodes to secondary server,
but still getting the same error.
[root@master1 geo-replication]# ssh root@drtier1data
-i /var/lib/glusterd/geo-replication/secret.pem
Last login: Mon Jan 29 00:14:32 2024 from
[root@drtier1data ~]#
[root@master2 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root@drtier1data
Last login: Mon Jan 29 00:02:34 2024 from
[root@drtier1data ~]#
[root@master3 ~]# ssh -i /var/lib/glusterd/geo-replication/secret.pem root@drtier1data
Last login: Mon Jan 29 00:14:41 2024 from
[root@drtier1data ~]#
Thanks,
Anant
From: Strahil Nikolov <hunter86_bg@xxxxxxxxx>
Sent: 28 January 2024 10:07 PM To: Anant Saraswat <anant.saraswat@xxxxxxxxxxxxxx>; gluster-users@xxxxxxxxxxx <gluster-users@xxxxxxxxxxx> Subject: Re: Geo-replication status is getting Faulty after few seconds EXTERNAL: Do not click links or open attachments if you do not recognize the sender. Gluster doesn't use the ssh key in /root/.ssh,
thus you need to exchange the public key that corresponds to /var/lib/glusterd/geo-replication/secret.pem . If you don't know the pub key, google how to obtain it from the private key.
Ensure that all hosts can ssh to the secondary
before proceeding with the troubleshooting.
Best Regards,
Strahil Nikolov
DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you
have received this email in error, please notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please
notify the sender immediately by email if you have received this email by mistake and delete this email from your system. DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you
have received this email in error, please notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please
notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you
have received this email in error, please notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please
notify the sender immediately by email if you have received this email by mistake and delete this email from your system. ________
Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users
DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you
have received this email in error, please notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please
notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please
notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if
you have received this email by mistake and delete this email from your system. |
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users