Danny, Levin, Thanks, both your answers helped (and are exactly what I suspected was the case). Looking back at the documentation I can see where my confusion began, as it isn't clear there that the "simplest" and "default" erasure code profiles are different. I'll report a documentation bug with the hope that they clarify things (I know of at least one other admin who hit the same issue I'm seeing, so I'm not the only one...). Cheers, Mark From: Danny Webb <Danny.Webb@xxxxxxxxxxxxxxx> Sent: 25 July 2022 14:32 To: Mark S. Holliman <msh@xxxxxxxxx>; ceph-users@xxxxxxx Subject: Re: Default erasure code profile not working for 3 node cluster? The only thing I can see from your setup is you've not set a failure domain in your crush rule, so it would default to host. And a 2/2 erasure code wouldn't work in that scenario as each stripe of the EC must be in it's own failure domain. If you wanted it to work with that setup you'd need to change the crush failure domain to OSD and not host (but you'd not have the ability to lose a host then). If you wanted to use a failure domain of host you'd need to set your k / m value to 2/1. And with that you'd still not be able to lose a host and still have a writable cluster. ________________________________ From: Mark S. Holliman <msh@xxxxxxxxx<mailto:msh@xxxxxxxxx>> Sent: 25 July 2022 14:13 To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> <ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> Subject: Default erasure code profile not working for 3 node cluster? CAUTION: This email originates from outside THG Dear All, I've recently setup a 3 node Ceph Quincy (17.2) cluster to serve a pair of CephFS mounts for a Slurm cluster. Each ceph node has 6 x SSD and 6 x HDD, and I've setup the pools and crush rules to create separate CephFS filesystems using the different disk classes. I used the default erasure-code-profile to create the pools (see details below), as the documentation states that it works on a 3 node cluster. The system looked healthy after the initial setup, but now a few weeks in I'm seeing signs of problems: a growing count of pgs not deep-scrubbed in time, significant numbers of pgs in "active+undersized"/"active+undersized+degraded", most pgs in a "active+clean+remapped" state, and no recovery activity. I looked at some of the pgs in the stuck states, and noticed that they all list a "NONE" OSD in their 'last acting' list, which points to this issue: https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean . It's likely that is what is causing the pgs to get stuck in a degraded state and the ever growing list of late deep scrubs. But I'm confused why the documentation states that the default erasure code should work on a 3 node cluster - https://docs.ceph.com/en/latest/rados/operations/erasure-code/#creating-a-sample-erasure-coded-pool Is this documentation in error? Or is there something else going on with my setup? What is an ideal erasure code profile for a 3 node system? Cheers, Mark ### Commands used to create the CephFS filesystem ### ceph osd pool create cephfsHDD_data 1024 1024 erasure ceph osd pool create cephfsHDD_metadata 64 64 ceph osd erasure-code-profile set dataHDD crush-device-class=hdd ceph osd crush rule create-erasure dataHDD dataHDD ceph osd pool set cephfsHDD_data crush_rule dataHDD ceph osd pool set cephfsHDD_data allow_ec_overwrites true ceph fs new cephfsHDD cephfsHDD_metadata cephfsHDD_data -force ### Example Status health: HEALTH_WARN Degraded data redundancy: 750/10450 objects degraded (7.177%), 313 pgs degraded, 775 pgs undersized 887 pgs not deep-scrubbed in time 887 pgs not scrubbed in time services: mon: 3 daemons... mgr: ... mds: 2/2 daemons up, 2 standby osd: 36 osds: 36 up (since 27h), 36 in (since 5w); 1272 remapped pgs data: volumes: 2/2 healthy pools: 5 pools, 2176 pgs objects: 2.82k objects, 361 MiB usage: 40 GiB used, 262 TiB / 262 TiB avail pgs: 750/10450 objects degraded (7.177%) 1240/10450 objects misplaced (11.866%) 1272 active+clean+remapped 462 active+undersized 313 active+undersized+degraded 129 active+clean ### Erasure Code Profile k=2 m=2 plugin=jerasure technique=reed_sol_van ### Pool details root@dokkalfar01:~# ceph osd pool get cephfsHDD_data all size: 4 min_size: 3 pg_num: 1023 pgp_num: 972 crush_rule: dataHDD hashpspool: true allow_ec_overwrites: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 erasure_code_profile: default fast_read: 0 pg_autoscale_mode: on eio: false bulk: false ### Example health details of unhappy pgs pg 3.282 is stuck undersized for 27h, current state active+undersized+degraded, last acting [29,15,5,NONE] pg 3.285 is stuck undersized for 27h, current state active+undersized+degraded, last acting [0,17,28,NONE] pg 3.286 is stuck undersized for 27h, current state active+undersized+degraded, last acting [3,17,26,NONE] pg 3.288 is stuck undersized for 27h, current state active+undersized+degraded, last acting [13,NONE,0,24] pg 3.28e is stuck undersized for 27h, current state active+undersized+degraded, last acting [28,NONE,5,14] pg 3.297 is stuck undersized for 27h, current state active+undersized+degraded, last acting [25,5,13,NONE] ------------------------------- Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh -------------------------------- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx> Danny Webb Principal OpenStack Engineer The Hut Group<http://www.thehutgroup.com/> Tel: Email: Danny.Webb@xxxxxxxxxxxxxxx<mailto:Danny.Webb@xxxxxxxxxxxxxxx> For the purposes of this email, the "company" means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries. Confidentiality Notice This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on +44(0)1606 811888 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of the company. Encryptions and Viruses Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions sent by e-mail. Monitoring Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes. hgvyjuv _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx