Re: [ CEPH ANSIBLE FAILOVER TESTING ] Ceph Native Driver issue

Lokendra Rathour <lokendrarathour@xxxxxxxxx> · Thu, 29 Apr 2021 19:02:55 +0530

Hi Reed,Thankyou so much  for the input and support. We have tried using the variable suggested by you, but could not see any impact on the current system. 
"ceph fs set cephfs allow_standby_replay true " it did not create any impact in the failover time

Furthermore we have tried more scenarios that we tested using our test :
scenario 1:

In this we have tried to see the logs at the new node on which the mds will failover to, i.e in this case if we reboot cephnode2 so new active MDS will be Cephnode1. Checking logs for cephnode1 in two scenarios:
1. normal reboot of Cephnode2 by keeping the I/O operation in progress,
we see that log at cephnode1 instantiates immediately and then wait for sometime (around 15 seconds for some beacon time) + some additional 6-7 seconds during which it activated MDS on cephnode1 and resumes I/O. Refer logs as :
2021-04-29T15:49:42.480+0530 7fa747690700  1 mds.cephnode1 Updating MDS map to version 505 from mon.2
2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505 handle_mds_map i am now mds.0.505
2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505 handle_mds_map state change up:boot --> up:replay
2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505 replay_start
2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505  recovery set is
2021-04-29T15:49:42.482+0530 7fa747690700  1 mds.0.505  waiting for osdmap 486 (which blacklists prior instance)
2021-04-29T15:49:55.686+0530 7fa74568c700  1 mds.beacon.cephnode1 MDS connection to Monitors appears to be laggy; 15.9769s since last acked beacon
2021-04-29T15:49:55.686+0530 7fa74568c700  1 mds.0.505 skipping upkeep work because connection to Monitors appears laggy
2021-04-29T15:49:57.533+0530 7fa749e95700  0 mds.beacon.cephnode1  MDS is no longer laggy
2021-04-29T15:49:59.599+0530 7fa740e83700  0 mds.0.cache creating system inode with ino:0x100
2021-04-29T15:49:59.599+0530 7fa740e83700  0 mds.0.cache creating system inode with ino:0x1
2021-04-29T15:50:00.456+0530 7fa73f680700  1 mds.0.505 Finished replaying journal
2021-04-29T15:50:00.456+0530 7fa73f680700  1 mds.0.505 making mds journal writeable
2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.cephnode1 Updating MDS map to version 506 from mon.2
2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 handle_mds_map i am now mds.0.505
2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 handle_mds_map state change up:replay --> up:reconnect
2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 reconnect_start
2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.505 reopen_log
2021-04-29T15:50:00.959+0530 7fa747690700  1 mds.0.server reconnect_clients -- 2 sessions
2021-04-29T15:50:00.964+0530 7fa747690700  0 log_channel(cluster) log [DBG] : reconnect by client.6892 v1:10.0.4.96:0/1646469259 after 0.00499997
2021-04-29T15:50:00.972+0530 7fa747690700  0 log_channel(cluster) log [DBG] : reconnect by client.6990 v1:10.0.4.115:0/2776266880 after 0.0129999
2021-04-29T15:50:00.972+0530 7fa747690700  1 mds.0.505 reconnect_done
2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.cephnode1 Updating MDS map to version 507 from mon.2
2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.0.505 handle_mds_map i am now mds.0.505
2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.0.505 handle_mds_map state change up:reconnect --> up:rejoin
2021-04-29T15:50:02.005+0530 7fa747690700  1 mds.0.505 rejoin_start
2021-04-29T15:50:02.008+0530 7fa747690700  1 mds.0.505 rejoin_joint_start
2021-04-29T15:50:02.040+0530 7fa740e83700  1 mds.0.505 rejoin_done
2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.cephnode1 Updating MDS map to version 508 from mon.2
2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 handle_mds_map i am now mds.0.505
2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 handle_mds_map state change up:rejoin --> up:clientreplay
2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 recovery_done -- successful recovery!
2021-04-29T15:50:03.050+0530 7fa747690700  1 mds.0.505 clientreplay_start
2021-04-29T15:50:03.094+0530 7fa740e83700  1 mds.0.505 clientreplay_done
2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.cephnode1 Updating MDS map to version 509 from mon.2
2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.0.505 handle_mds_map i am now mds.0.505
2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.0.505 handle_mds_map state change up:clientreplay --> up:active
2021-04-29T15:50:04.081+0530 7fa747690700  1 mds.0.505 active_start
2021-04-29T15:50:04.085+0530 7fa747690700  1 mds.0.505 cluster recovered.

hard reset/power-off of  Cephnode2 by keeping the I/O operation in progress:
In this case we see that the system logs at cephnode 1(on which new MDS will be activated) gets activated after 15+ seconds of power-off.
Time at which power-off was it : 2021-04-29-16-17-37
Time at which the logs started to show in cephnode 1 (refer logs) i.e log started nearly after 15 seconds of hardware reset:
2021-04-29T16:17:51.983+0530 7f5ba3a38700  1 mds.cephnode1 Updating MDS map to version 518 from mon.0
2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map i am now mds.0.518
2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map state change up:boot --> up:replay
2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518 replay_start
2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518  recovery set is
2021-04-29T16:17:51.984+0530 7f5ba3a38700  1 mds.0.518  waiting for osdmap 504 (which blacklists prior instance)
2021-04-29T16:17:54.044+0530 7f5b9ca2a700  0 mds.0.cache creating system inode with ino:0x100
2021-04-29T16:17:54.045+0530 7f5b9ca2a700  0 mds.0.cache creating system inode with ino:0x1
2021-04-29T16:17:55.025+0530 7f5b9ba28700  1 mds.0.518 Finished replaying journal
2021-04-29T16:17:55.025+0530 7f5b9ba28700  1 mds.0.518 making mds journal writeable
2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.cephnode1 Updating MDS map to version 519 from mon.0
2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map i am now mds.0.518
2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map state change up:replay --> up:reconnect
2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518 reconnect_start
2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.518 reopen_log
2021-04-29T16:17:56.060+0530 7f5ba3a38700  1 mds.0.server reconnect_clients -- 2 sessions
2021-04-29T16:17:56.068+0530 7f5ba3a38700  0 log_channel(cluster) log [DBG] : reconnect by client.6990 v1:10.0.4.115:0/2776266880 after 0.00799994
2021-04-29T16:17:56.069+0530 7f5ba3a38700  0 log_channel(cluster) log [DBG] : reconnect by client.6892 v1:10.0.4.96:0/1646469259 after 0.00899994
2021-04-29T16:17:56.069+0530 7f5ba3a38700  1 mds.0.518 reconnect_done
2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.cephnode1 Updating MDS map to version 520 from mon.0
2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map i am now mds.0.518
2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map state change up:reconnect --> up:rejoin
2021-04-29T16:17:57.099+0530 7f5ba3a38700  1 mds.0.518 rejoin_start
2021-04-29T16:17:57.103+0530 7f5ba3a38700  1 mds.0.518 rejoin_joint_start
2021-04-29T16:17:57.472+0530 7f5b9d22b700  1 mds.0.518 rejoin_done
2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.cephnode1 Updating MDS map to version 521 from mon.0
2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map i am now mds.0.518
2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map state change up:rejoin --> up:clientreplay
2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518 recovery_done -- successful recovery!
2021-04-29T16:17:58.138+0530 7f5ba3a38700  1 mds.0.518 clientreplay_start
2021-04-29T16:17:58.157+0530 7f5b9d22b700  1 mds.0.518 clientreplay_done
2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.cephnode1 Updating MDS map to version 522 from mon.0
2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map i am now mds.0.518
2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.0.518 handle_mds_map state change up:clientreplay --> up:active
2021-04-29T16:17:59.178+0530 7f5ba3a38700  1 mds.0.518 active_start
2021-04-29T16:17:59.181+0530 7f5ba3a38700  1 mds.0.518 cluster recovered.
In both the test cases above we saw some extra delay of around 15 seconds + 8-10 seconds. (total 21-25 seconds for failover in case of power-off/reboot), 

Query: Any specific config that may need to be tweaked/tried to reduce this time for MDS to know that it has to activate and start the standby MDS Node?)

  Scenario 2:  
Only stop MDS Daemon Service on Active Node
In this scenario when we only tried stopping systemctl service for the MDS Node on Active Node, we have very good reading of around 5-7 Seconds for failover.

  Deployment
  Mode
  CEPH MDS Setup
  Test Case
  I/O Resume
  Duration

    (Seconds)
  Node affected

  2 Node MDS Setupwith max_mds=1
  Active-Standby
  MDS
  with
  Active Node MDS Demon stop
  5-7
  cephnode 1

Please suggest/advice if we can try to configure to achieve minimal failover duration in the first two scenarios. 

Best Regards,
Lokendra

On Thu, Apr 29, 2021 at 1:47 AM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
I don't have anything of merit to add to this, but it would be an interesting addition to your testing to see if active+standby-replay makes any difference with test-case1.
I don't think it would be applicable to any of the other use-cases, as a standby-replay MDS is bound to a single rank, meaning its bound to a single active MDS, and can't function as a standby for active:active.

https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/ceph_file_system_guide_technology_preview/installing_and_configuring_ceph_metadata_servers_mds#mds-configuring-standby-daemons-standby-replay

Good luck and look forward to hearing feedback/more results.

Reed

On Apr 27, 2021, at 8:40 AM, Lokendra Rathour <lokendrarathour@xxxxxxxxx> wrote:

Hi Team,
We have setup two Node Ceph Cluster using Native Cephfs
Driver with Details as:

 3 Node / 2 Node MDS
     Cluster
 3 Node Monitor Quorum
 2 Node OSD 
 2 Nodes for Manager

Cephnode3 have only Mon and MDS (only for test case 4-7)
rest two nodes i.e. cephnode1 and cephnode2 have (mgr,mds,mon,rgw)

We have tested following failover scenarios for Native
Cephfs Driver by mounting for any one sub-volume on a VM or client with
continuous I/O operations(Directory creation after every 1 Second):

<image.png>

In the table above we have few queries as:

 Refer test case 2 and
     test case 7, both are similar test case with only difference in number of
     Ceph MDS with time for both the test cases is different. It should be
     zero. But time is coming as 17 seconds for testcase 7.
 Is there any
     configurable parameter/any configuration which we need to make in the Ceph
     cluster to get the failover time reduced to few seconds? 
In current default deployment we
are getting something around 35-40 seconds.

Best Regards,

-- 
~ Lokendra
www.inertiaspeaks.com
www.inertiagroups.com
skype: lokendrarathour

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
~ Lokendra
www.inertiaspeaks.com
www.inertiagroups.com
skype: lokendrarathour

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx