Re: [ Ceph MDS MON Config Variables ] Failover Delay issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I wouldn't recommend a colocated MDS in a production environment.


Zitat von Lokendra Rathour <lokendrarathour@xxxxxxxxx>:

Hello Frank,
Thanks for your inputs.

*Responding to your Queries , Kindly refer below:*

   - *Do you have services co-located? *
   - [loke] : Yes they are colocated:
         - Cephnode1 : MDS,MGR,MON,RGW,OSD,MDS
         - Cephnode2: MDS,MGR,MON,RGW,OSD,MDS
         - Cephnode3: MON
      - Which of the times (1) or (2) are you referring to?
   - For you part One (1) : we can say it like by counting the time since
      the I/O is stopped till the I/O is resumed, which includes

      - call for new mon election
         - election of mon leader
         - calling of MDS standby acting by new Mon Leader
         - resuming I/O stuck threads
         - and other internal process.(i am only point what i could read
         from logs)

      - *How many FS Clients do you have?*
      - we are testing with only one client mounting using native fs driver
      at the moment, where we pass both IP Address of both the MDS
Daemon(in our
      case both the Ceph Nodes) using following method:
         - sudo mount -t ceph 10.0.4.10,10.0.4.11:6789:/volumes/path/
         /mnt/cephconf -o
name=foo,secret=AQAus49gdCHvIxAAB89BcDYqYSqJ8yOJBg5grw==


*one input*:if we only shut-down MDS Active Daemon, we only get 4-7
Seconds, i.e if we are not rebooting the physical node but only the service
MDS.
When we reboot Physical node , Cephnode1 or Cephnode2 ( Mon,Mgr,RGW,OSD
also gets rebooted along with MDS)  we realizing around 40 seconds.

Best Regards,
Lokendra


On Mon, May 3, 2021 at 10:30 PM Frank Schilder <frans@xxxxxx> wrote:

Following up on this and other comments, there are 2 different time
delays. One (1)  is the time it takes from killing an MDS until a stand-by
is made an active rank, and (2) the time it takes for the new active rank
to restore all client sessions. My experience is that (1) takes close to 0
seconds while (2) can take between 20-30 seconds depending on how busy the
clients are; the MDS will go through various states before reaching active.
We usually have ca. 1600 client connections to our FS. With fewer clients,
MDS fail-over is practically instantaneous. We are using latest mimic.

From what you write, you seem to have a 40 seconds window for (1), which
points to a problem different to MON config values. This is supported by
your description including a MON election (??? this should never happen).
Do you have have services co-located? Which of the times (1) or (2) are you
referring to? How many FS clients do you have?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: 03 May 2021 17:19:37
To: Lokendra Rathour
Cc: Ceph Development; dev; ceph-users
Subject:  Re: [ Ceph MDS MON Config Variables ] Failover Delay
issue

On Mon, May 3, 2021 at 6:36 AM Lokendra Rathour
<lokendrarathour@xxxxxxxxx> wrote:
>
> Hi Team,
> I was setting up the ceph cluster with
>
>    - Node Details:3 Mon,2 MDS, 2 Mgr, 2 RGW
>    - Deployment Type: Active Standby
>    - Testing Mode: Failover of MDS Node
>    - Setup : Octopus (15.2.7)
>    - OS: centos 8.3
>    - hardware: HP
>    - Ram:  128 GB on each Node
>    - OSD: 2 ( 1 tb each)
>    - Operation: Normal I/O with mkdir on every 1 second.
>
> T*est Case: Power-off any active MDS Node for failover to happen*
>
> *Observation:*
> We have observed that whenever an active MDS Node is down it takes
around*
> 40 seconds* to activate the standby MDS Node.
> on further checking the logs for the new-handover MDS Node we have seen
> delay on the basis of following inputs:
>
>    1. 10 second delay after which Mon calls for new Monitor election
>       1.  [log]  0 log_channel(cluster) log [INF] : mon.cephnode1 calling
>       monitor election

In the process of killing the active MDS, are you also killing a monitor?

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



--
~ Lokendra
www.inertiaspeaks.com
www.inertiagroups.com
skype: lokendrarathour
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux