Re: stretched cluster new pool and second pool with nvme

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



How exactly does your crush rule look right now? I assume it's supposed to distribute data across two sites, and since one site is missing, the PGs stay in degraded state until the site comes back up. You would need to either change the crush rule or assign a different one to that pool which would allow to recover one the remaining site.

Zitat von "ronny.lippold" <ceph@xxxxxxxxx>:

hi stefan ... i did the next step and need your help.

my idea was to stretch the cluster without stretch mode. so we decided to reserve a size of 4 on each side.

the setup is the same as stretched mode, also crush rule, location, election_strategy and tie breaker.
only "ceph mon enable_stretch_mode e stretch_rule datacenter" wasn't made.

now in my test, i made a split brain and expect, that on the remaining side, the cluster will rebuild the 4 replica.
but that did not happen.
actually, the cluster, is doing the same stuff, as stretch mode enabled. writeable with 2 replica.

can you explain me why? i'm spinning around.

this is the status during split brain:

######################################
pve-test02-01:~# ceph -s
  cluster:
    id:     376fcdef-bba0-4e58-b63e-c9754dc948fa
    health: HEALTH_WARN
6/13 mons down, quorum pve-test01-01,pve-test01-03,pve-test01-05,pve-test02-01,pve-test02-03,pve-test02-05,tie-breaker
            1 datacenter (8 osds) down
            8 osds down
            6 hosts (8 osds) down
Degraded data redundancy: 2116/4232 objects degraded (50.000%), 95 pgs degraded, 113 pgs undersized

  services:
mon: 13 daemons, quorum pve-test01-01,pve-test01-03,pve-test01-05,pve-test02-01,pve-test02-03,pve-test02-05,tie-breaker (age 54m), out of quorum: pve-test01-02, pve-test01-04, pve-test01-06, pve-test02-02, pve-test02-04, pve-test02-06 mgr: pve-test02-05(active, since 53m), standbys: pve-test01-05, pve-test01-01, pve-test01-03, pve-test02-01, pve-test02-03
    mds: 1/1 daemons up, 1 standby
    osd: 16 osds: 8 up (since 54m), 16 in (since 77m)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 1.06k objects, 3.9 GiB
    usage:   9.7 GiB used, 580 GiB / 590 GiB avail
    pgs:     2116/4232 objects degraded (50.000%)
             95 active+undersized+degraded
             18 active+undersized

  io:
    client:   17 KiB/s wr, 0 op/s rd, 10 op/s wr
######################################

thanks a lot,
ronny

Am 2024-04-30 11:42, schrieb Stefan Kooman:
On 30-04-2024 11:22, ronny.lippold wrote:
hi stefan ... you are the hero of the month ;)

:p.


i don't know, why i did not found your bug report.

i have the exact same problem and resolved the HEALTH only with "ceph osd force_healthy_stretch_mode --yes-i-really-mean-it"
will comment the report soon.

actually, we think about 4/2 size without stretch mode enable.

what was your solution?

This specific setup (on which I did the testing) is going to be full flash (SSD). So the HDDs are going to be phased out. And only the default non-device-class crush rule will be used. While that will work for this (small) cluster, it is not a solution. This issue should be fixed, as I figure there are quite a few cluster that want to use device-classes and use stretch mode at the same time.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux