How To Properly Failover a HA Setup

Charles Tassell <charles@xxxxxxxxxxxxxx> · Mon, 21 Jan 2019 04:22:31 -0400

Hello Everyone,

  I've got a 3 node Jewel cluster setup, and I think I'm missing 
something.  When I want to take one of my nodes down for maintenance 
(kernel upgrades or the like) all of my clients (running the kernel 
module for the cephfs filesystem) hang for a couple of minutes before 
the redundant servers kick in.  Is there some commands I can enter 
before taking a box down to speed this up?  Or some way to have the 
clients/cluster detect failure quicker?

  My setup is as follows.  I have three servers: mon1, file1 and 
file2.  All three boxes are running monitor daemons.  file1 and file2 
are also running MDS and OSD daemons. My ceph -s looks as such:

   cluster 8ab75485-8141-44ad-8eee-f92f286515ac
     health HEALTH_OK
     monmap e2: 3 mons at 
{FILE1=10.1.1.201:6789/0,FILE2=10.1.1.202:6789/0,MON1=10.1.1.90:6789/0}
            election epoch 1384, quorum 0,1,2 MON1,FILE1,FILE2
      fsmap e63874: 1/1/1 up {0=FILE1=up:active}, 1 up:standby
     osdmap e639: 2 osds: 2 up, 2 in
            flags sortbitwise,require_jewel_osds
      pgmap v15312803: 128 pgs, 3 pools, 236 GB data, 729 kobjects
            493 GB used, 406 GB / 899 GB avail
                 128 active+clean
  client io 3243 kB/s rd, 776 kB/s wr, 8 op/s rd, 1 op/s wr

and my client fstabs look like this:

10.1.1.90:6789,10.1.1.201:6789,10.1.1.202:6789:/shared /mnt/shared ceph 
noatime,_netdev,name=webdata,secretfile=/etc/ceph/websecret 0 0

Any help would be appreciated.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com