Re: Default erasure code profile not working for 3 node cluster?

Levin Ng <levindecaro@xxxxxxxxx> · Mon, 25 Jul 2022 13:29:30 +0000

Hi Mark,

K=2 + M=2 EC profile with set to host failure domain will require at least 4 node. “The simplest erasure coded pool is equivalent to RAID5<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5> and requires at least three hosts:”.  This is assume your EC Profile is K=2+M=1 which is technically minimum configuration, but  generally NOT recommended for data durability.

Regards, Levin

From: Mark S. Holliman <msh@xxxxxxxxx>
Date: Monday, 25 July 2022 at 21:15
To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject:  Default erasure code profile not working for 3 node cluster?
Dear All,

I've recently setup a 3 node Ceph Quincy (17.2) cluster to serve a pair of CephFS mounts for a Slurm cluster. Each ceph node has 6 x SSD and 6 x HDD, and I've setup the pools and crush rules to create separate CephFS filesystems using the different disk classes. I used the default erasure-code-profile to create the pools (see details below), as the documentation states that it works on a 3 node cluster. The system looked healthy after the initial setup, but now a few weeks in I'm seeing signs of problems: a growing count of pgs not deep-scrubbed in time, significant numbers of pgs in "active+undersized"/"active+undersized+degraded", most pgs in a "active+clean+remapped" state, and no recovery activity.

I looked at some of the pgs in the stuck states, and noticed that they all list a "NONE" OSD in their 'last acting' list, which points to this issue: https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean .

It's likely that is what is causing the pgs to get stuck in a degraded state and the ever growing list of late deep scrubs. But I'm confused why the documentation states that the default erasure code should work on a 3 node cluster - https://docs.ceph.com/en/latest/rados/operations/erasure-code/#creating-a-sample-erasure-coded-pool  Is this documentation in error? Or is there something else going on with my setup? What is an ideal erasure code profile for a 3 node system?

Cheers,
  Mark

### Commands used to create the CephFS filesystem ###
ceph osd pool create cephfsHDD_data 1024 1024 erasure
ceph osd pool create cephfsHDD_metadata 64 64
ceph osd erasure-code-profile set dataHDD crush-device-class=hdd
ceph osd crush rule create-erasure dataHDD dataHDD
ceph osd pool set cephfsHDD_data crush_rule dataHDD
ceph osd pool set cephfsHDD_data allow_ec_overwrites true
ceph fs new cephfsHDD cephfsHDD_metadata cephfsHDD_data -force

### Example Status
    health: HEALTH_WARN
            Degraded data redundancy: 750/10450 objects degraded (7.177%), 313 pgs degraded, 775 pgs undersized
            887 pgs not deep-scrubbed in time
            887 pgs not scrubbed in time
  services:
    mon: 3 daemons...
    mgr: ...
    mds: 2/2 daemons up, 2 standby
    osd: 36 osds: 36 up (since 27h), 36 in (since 5w); 1272 remapped pgs
  data:
    volumes: 2/2 healthy
    pools:   5 pools, 2176 pgs
    objects: 2.82k objects, 361 MiB
    usage:   40 GiB used, 262 TiB / 262 TiB avail
    pgs:     750/10450 objects degraded (7.177%)
             1240/10450 objects misplaced (11.866%)
             1272 active+clean+remapped
             462  active+undersized
             313  active+undersized+degraded
             129  active+clean

### Erasure Code Profile
k=2
m=2
plugin=jerasure
technique=reed_sol_van

### Pool details
root@dokkalfar01:~# ceph osd pool get cephfsHDD_data all
size: 4
min_size: 3
pg_num: 1023
pgp_num: 972
crush_rule: dataHDD
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: default
fast_read: 0
pg_autoscale_mode: on
eio: false
bulk: false

### Example health details of unhappy pgs
    pg 3.282 is stuck undersized for 27h, current state active+undersized+degraded, last acting [29,15,5,NONE]
    pg 3.285 is stuck undersized for 27h, current state active+undersized+degraded, last acting [0,17,28,NONE]
    pg 3.286 is stuck undersized for 27h, current state active+undersized+degraded, last acting [3,17,26,NONE]
    pg 3.288 is stuck undersized for 27h, current state active+undersized+degraded, last acting [13,NONE,0,24]
    pg 3.28e is stuck undersized for 27h, current state active+undersized+degraded, last acting [28,NONE,5,14]
    pg 3.297 is stuck undersized for 27h, current state active+undersized+degraded, last acting [25,5,13,NONE]

-------------------------------
Mark Holliman
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------------------------------
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx