Re: HA and data recovery of CEPH

"hfx@xxxxxxxxxx" <hfx@xxxxxxxxxx> · Fri, 29 Nov 2019 14:23:01 +0800

Hi Nathan

We build a ceph cluster with 3 nodes.
node-3: osd-2, mon-b, 
node-4: osd-0, mon-a, mds-myfs-a, mgr
node-5: osd-1, mon-c, mds-myfs-b

ceph cluster created by rook.
Test phenomenonAfter one node unusual down(like direct poweroff), try to mount cephfs volume will spend more than 40 seconds. 
Normally Ceph Cluster Status:
$ ceph status      
  cluster:
    id:     776b5432-be9c-455f-bb2e-05cbf20d6f6a
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 20h)
    mgr: a(active, since 21h)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby
    osd: 3 osds: 3 up (since 20h), 3 in (since 21h)

  data:
    pools:   2 pools, 136 pgs
    objects: 2.59k objects, 330 MiB
    usage:   25 GiB used, 125 GiB / 150 GiB avail
    pgs:     136 active+clean

  io:
    client:   1.5 KiB/s wr, 0 op/s rd, 0 op/s wr

Normally CephFS Status:
$ ceph fs status
myfs - 3 clients
====
+------+--------+--------+---------------+-------+-------+
| Rank | State  |  MDS   |    Activity   |  dns  |  inos |
+------+--------+--------+---------------+-------+-------+
|  0   | active | myfs-a | Reqs:    0 /s | 2250  | 2059  |
+------+--------+--------+---------------+-------+-------+
+---------------+----------+-------+-------+
|      Pool     |   type   |  used | avail |
+---------------+----------+-------+-------+
| myfs-metadata | metadata |  208M | 39.1G |
|   myfs-data0  |   data   |  121M | 39.1G |
+---------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|    myfs-b   |
+-------------+
MDS version: ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

Are you using replica or EC?  
            => Not used EC

'min_size' is not smaller than 'size'?
$ ceph osd dump | grep pool
pool 1 'myfs-metadata' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 16 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 2 'myfs-data0' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 141 lfor 0/0/53 flags hashpspool stripe_width 0 application cephfs

What is your crush map? $ ceph osd crush dump
{
    "devices": [
        {
            "id": 0,
            "name": "osd.0",
            "class": "hdd"
        },
        {
            "id": 1,
            "name": "osd.1",
            "class": "hdd"
        },
        {
            "id": 2,
            "name": "osd.2",
            "class": "hdd"
        }
    ],
    "types": [
        {
            "type_id": 0,
            "name": "osd"
        },
        {
            "type_id": 1,
            "name": "host"
        },
        {
            "type_id": 2,
            "name": "chassis"
        },
        {
            "type_id": 3,
            "name": "rack"
        },
        {
            "type_id": 4,
            "name": "row"
        },
        {
            "type_id": 5,
            "name": "pdu"
        },
        {
            "type_id": 6,
            "name": "pod"
        },
        {
            "type_id": 7,
            "name": "room"
        },
        {
            "type_id": 8,
            "name": "datacenter"
        },
        {
            "type_id": 9,
            "name": "zone"
        },
        {
            "type_id": 10,
            "name": "region"
        },
        {
            "type_id": 11,
            "name": "root"
        }
    ],
    "buckets": [
        {
            "id": -1,
            "name": "default",
            "type_id": 11,
            "type_name": "root",
            "weight": 9594,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": -3,
                    "weight": 3198,
                    "pos": 0
                },
                {
                    "id": -5,
                    "weight": 3198,
                    "pos": 1
                },
                {
                    "id": -7,
                    "weight": 3198,
                    "pos": 2
                }
            ]
        },
        {
            "id": -2,
            "name": "default~hdd",
            "type_id": 11,
            "type_name": "root",
            "weight": 9594,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": -4,
                    "weight": 3198,
                    "pos": 0
                },
                {
                    "id": -6,
                    "weight": 3198,
                    "pos": 1
                },
                {
                    "id": -8,
                    "weight": 3198,
                    "pos": 2
                }
            ]
        },
        {
            "id": -3,
            "name": "node-4",
            "type_id": 1,
            "type_name": "host",
            "weight": 3198,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": 0,
                    "weight": 3198,
                    "pos": 0
                }
            ]
        },
        {
            "id": -4,
            "name": "node-4~hdd",
            "type_id": 1,
            "type_name": "host",
            "weight": 3198,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": 0,
                    "weight": 3198,
                    "pos": 0
                }
            ]
        },
        {
            "id": -5,
            "name": "node-5",
            "type_id": 1,
            "type_name": "host",
            "weight": 3198,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": 1,
                    "weight": 3198,
                    "pos": 0
                }
            ]
        },
        {
            "id": -6,
            "name": "node-5~hdd",
            "type_id": 1,
            "type_name": "host",
            "weight": 3198,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": 1,
                    "weight": 3198,
                    "pos": 0
                }
            ]
        },
        {
            "id": -7,
            "name": "node-3",
            "type_id": 1,
            "type_name": "host",
            "weight": 3198,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": 2,
                    "weight": 3198,
                    "pos": 0
                }
            ]
        },
        {
            "id": -8,
            "name": "node-3~hdd",
            "type_id": 1,
            "type_name": "host",
            "weight": 3198,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": [
                {
                    "id": 2,
                    "weight": 3198,
                    "pos": 0
                }
            ]
        }
    ],
    "rules": [
        {
            "rule_id": 0,
            "rule_name": "replicated_rule",
            "ruleset": 0,
            "type": 1,
            "min_size": 1,
            "max_size": 10,
            "steps": [
                {
                    "op": "take",
                    "item": -1,
                    "item_name": "default"
                },
                {
                    "op": "chooseleaf_firstn",
                    "num": 0,
                    "type": "host"
                },
                {
                    "op": "emit"
                }
            ]
        },
        {
            "rule_id": 1,
            "rule_name": "myfs-metadata",
            "ruleset": 1,
            "type": 1,
            "min_size": 1,
            "max_size": 10,
            "steps": [
                {
                    "op": "take",
                    "item": -1,
                    "item_name": "default"
                },
                {
                    "op": "chooseleaf_firstn",
                    "num": 0,
                    "type": "host"
                },
                {
                    "op": "emit"
                }
            ]
        },
        {
            "rule_id": 2,
            "rule_name": "myfs-data0",
            "ruleset": 2,
            "type": 1,
            "min_size": 1,
            "max_size": 10,
            "steps": [
                {
                    "op": "take",
                    "item": -1,
                    "item_name": "default"
                },
                {
                    "op": "chooseleaf_firstn",
                    "num": 0,
                    "type": "host"
                },
                {
                    "op": "emit"
                }
            ]
        }
    ],
    "tunables": {
        "choose_local_tries": 0,
        "choose_local_fallback_tries": 0,
        "choose_total_tries": 50,
        "chooseleaf_descend_once": 1,
        "chooseleaf_vary_r": 1,
        "chooseleaf_stable": 1,
        "straw_calc_version": 1,
        "allowed_bucket_algs": 54,
        "profile": "jewel",
        "optimal_tunables": 1,
        "legacy_tunables": 0,
        "minimum_required_version": "jewel",
        "require_feature_tunables": 1,
        "require_feature_tunables2": 1,
        "has_v2_rules": 0,
        "require_feature_tunables3": 1,
        "has_v3_rules": 0,
        "has_v4_buckets": 1,
        "require_feature_tunables5": 1,
        "has_v5_rules": 0
    },
    "choose_args": {}
}

Question
How can i mount CephFS volumn as soon as possible, after one node unusual down.
Any ceph cluster(filesystem) configuration suggestion? Using EC?

Best Regards

hfx@xxxxxxxxxx

From: jesper
Date: 2019-11-29 13:28
To: Peng Bo
CC: Ceph Users; hfx; Nathan Fish
Subject: Re[2]: [ceph-users] HA and data recovery of CEPH

Hi Nathan

Is that true?

The time it takes to reallocate the primary pg delivers “downtime” by design.  right? Seen from a writing clients perspective 

Jesper

Sent from myMail for iOS

Friday, 29 November 2019, 06.24 +0100 from pengbo@xxxxxxxxxxx  <pengbo@xxxxxxxxxxx>:

			Hi Nathan, 
Thanks for the help.
My colleague will provide more details.

BR

On Fri, Nov 29, 2019 at 12:57 PM Nathan Fish <lordcirth@xxxxxxxxx> wrote:
If correctly configured, your cluster should have zero downtime from a

single OSD or node failure. What is your crush map? Are you using

replica or EC? If your 'min_size' is not smaller than 'size', then you

will lose availability.

On Thu, Nov 28, 2019 at 10:50 PM Peng Bo <pengbo@xxxxxxxxxxx> wrote:

>

> Hi all,

>

> We are working on use CEPH to build our HA system, the purpose is the system should always provide service even a node of CEPH is down or OSD is lost.

>

> Currently, as we practiced once a node/OSD is down, the CEPH cluster needs to take about 40 seconds to sync data, our system can't provide service during that.

>

> My questions:

>

> Does there have any way that we can reduce the data sync time?

> How can we let the CEPH keeps available once a node/OSD is down?

>

>

> BR

>

> --

> The modern Unified Communications provider

>

> https://www.portsip.com

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
The modern Unified Communications provider
https://www.portsip.com

			_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com