Re: Default erasure code profile not working for 3 node cluster?

"Mark S. Holliman" <msh@xxxxxxxxx> · Mon, 25 Jul 2022 13:45:35 +0000

Danny, Levin,

Thanks, both your answers helped (and are exactly what I suspected was the case). Looking back at the documentation I can see where my confusion began, as it isn't clear there that the "simplest" and "default" erasure code profiles are different. I'll report a documentation bug with the hope that they clarify things (I know of at least one other admin who hit the same issue I'm seeing, so I'm not the only one...).

Cheers,
  Mark

From: Danny Webb <Danny.Webb@xxxxxxxxxxxxxxx>
Sent: 25 July 2022 14:32
To: Mark S. Holliman <msh@xxxxxxxxx>; ceph-users@xxxxxxx
Subject: Re: Default erasure code profile not working for 3 node cluster?

The only thing I can see from your setup is you've not set a failure domain in your crush rule, so it would default to host.  And a 2/2 erasure code wouldn't work in that scenario as each stripe of the EC must be in it's own failure domain.   If you wanted it to work with that setup you'd need to change the crush failure domain to OSD and not host (but you'd not have the ability to lose a host then).  If you wanted to use a failure domain of host you'd need to set your k / m value to 2/1.  And with that you'd still not be able to lose a host and still have a writable cluster.
________________________________
From: Mark S. Holliman <msh@xxxxxxxxx<mailto:msh@xxxxxxxxx>>
Sent: 25 July 2022 14:13
To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> <ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
Subject:  Default erasure code profile not working for 3 node cluster?

CAUTION: This email originates from outside THG

Dear All,

I've recently setup a 3 node Ceph Quincy (17.2) cluster to serve a pair of CephFS mounts for a Slurm cluster. Each ceph node has 6 x SSD and 6 x HDD, and I've setup the pools and crush rules to create separate CephFS filesystems using the different disk classes. I used the default erasure-code-profile to create the pools (see details below), as the documentation states that it works on a 3 node cluster. The system looked healthy after the initial setup, but now a few weeks in I'm seeing signs of problems: a growing count of pgs not deep-scrubbed in time, significant numbers of pgs in "active+undersized"/"active+undersized+degraded", most pgs in a "active+clean+remapped" state, and no recovery activity.

I looked at some of the pgs in the stuck states, and noticed that they all list a "NONE" OSD in their 'last acting' list, which points to this issue: https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean .

It's likely that is what is causing the pgs to get stuck in a degraded state and the ever growing list of late deep scrubs. But I'm confused why the documentation states that the default erasure code should work on a 3 node cluster - https://docs.ceph.com/en/latest/rados/operations/erasure-code/#creating-a-sample-erasure-coded-pool Is this documentation in error? Or is there something else going on with my setup? What is an ideal erasure code profile for a 3 node system?

Cheers,
Mark

### Commands used to create the CephFS filesystem ###
ceph osd pool create cephfsHDD_data 1024 1024 erasure
ceph osd pool create cephfsHDD_metadata 64 64
ceph osd erasure-code-profile set dataHDD crush-device-class=hdd
ceph osd crush rule create-erasure dataHDD dataHDD
ceph osd pool set cephfsHDD_data crush_rule dataHDD
ceph osd pool set cephfsHDD_data allow_ec_overwrites true
ceph fs new cephfsHDD cephfsHDD_metadata cephfsHDD_data -force

### Example Status
health: HEALTH_WARN
Degraded data redundancy: 750/10450 objects degraded (7.177%), 313 pgs degraded, 775 pgs undersized
887 pgs not deep-scrubbed in time
887 pgs not scrubbed in time
services:
mon: 3 daemons...
mgr: ...
mds: 2/2 daemons up, 2 standby
osd: 36 osds: 36 up (since 27h), 36 in (since 5w); 1272 remapped pgs
data:
volumes: 2/2 healthy
pools: 5 pools, 2176 pgs
objects: 2.82k objects, 361 MiB
usage: 40 GiB used, 262 TiB / 262 TiB avail
pgs: 750/10450 objects degraded (7.177%)
1240/10450 objects misplaced (11.866%)
1272 active+clean+remapped
462 active+undersized
313 active+undersized+degraded
129 active+clean

### Erasure Code Profile
k=2
m=2
plugin=jerasure
technique=reed_sol_van

### Pool details
root@dokkalfar01:~# ceph osd pool get cephfsHDD_data all
size: 4
min_size: 3
pg_num: 1023
pgp_num: 972
crush_rule: dataHDD
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: default
fast_read: 0
pg_autoscale_mode: on
eio: false
bulk: false

### Example health details of unhappy pgs
pg 3.282 is stuck undersized for 27h, current state active+undersized+degraded, last acting [29,15,5,NONE]
pg 3.285 is stuck undersized for 27h, current state active+undersized+degraded, last acting [0,17,28,NONE]
pg 3.286 is stuck undersized for 27h, current state active+undersized+degraded, last acting [3,17,26,NONE]
pg 3.288 is stuck undersized for 27h, current state active+undersized+degraded, last acting [13,NONE,0,24]
pg 3.28e is stuck undersized for 27h, current state active+undersized+degraded, last acting [28,NONE,5,14]
pg 3.297 is stuck undersized for 27h, current state active+undersized+degraded, last acting [25,5,13,NONE]

-------------------------------
Mark Holliman
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------------------------------
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

Danny Webb
Principal OpenStack Engineer
The Hut Group<http://www.thehutgroup.com/>

Tel:
Email: Danny.Webb@xxxxxxxxxxxxxxx<mailto:Danny.Webb@xxxxxxxxxxxxxxx>

For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.

Confidentiality Notice
This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company.

Encryptions and Viruses
Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail.

Monitoring
Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes.
hgvyjuv
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx