Re: Question about erasure coding on cephfs

Patrick Begou <Patrick.Begou@xxxxxxxxxxxxxxxxxxxxxx> · Sun, 3 Mar 2024 09:18:10 +0100

Hi Erich,

about a similar problem I asked some months ago,Frank Schilder published 
this on the list (December 6, 2023) and it may be helpfull for your 
setup. I've not tested yet, my cluster is still in deployment state.

       To provide some first-hand experience, I was operating a pool with a 6+2 EC profile on 4 hosts for a while (until we got more hosts) and the "subdivide a physical host into 2 crush-buckets" approach is actually working best (I basically tried all the approaches described in the linked post and they all had pitfalls).

       Procedure is more or less:

       - add second (logical) host bucket for each physical host by suffixing the host name with "-B" (ceph osd crush add-bucket <name> <type> <location>)
       - move half the OSDs per host to this new host bucket (ceph osd crush move osd.ID host=HOSTNAME-B)
       - make this location persist reboot of the OSDs (ceph config set osd.ID crush_location host=HOSTNAME-B")

       This will allow you to move OSDs back easily when you get more hosts and can afford the recommended 1 shard per host. It will also show which and where OSDs are moved to with a simple "ceph config dump | grep crush_location". Bets of all, you don't have to fiddle around with crush maps and hope they do what you want. Just use failure domain host and you are good. No more than 2 host buckets per physical host means no more than 2 shards per physical host with default placement rules.

       I was operating this set-up with min_size=6 and feeling bad about it due to the reduced maintainability (risk of data loss during maintenance). Its not great really, but sometimes there is no way around it. I was happy when I got the extra hosts.

Patrick

Le 02/03/2024 à 16:37, Erich Weiler a écrit :
Hi Y'all,

We have a new ceph cluster online that looks like this:

md-01 : monitor, manager, mds
md-02 : monitor, manager, mds
md-03 : monitor, manager
store-01 : twenty 30TB NVMe OSDs
store-02 : twenty 30TB NVMe OSDs

The cephfs storage is using erasure coding at 4:2.  The crush domain 
is set to "osd".

(I know that's not optimal but let me get to that in a minute)

We have a current regular single NFS server (nfs-01) with the same 
storage as the OSD servers above (twenty 30TB NVME disks).  We want to 
wipe the NFS server and integrate it into the above ceph cluster as 
"store-03".  When we do that, we would then have three OSD servers.  
We would then switch the crush domain to "host".

My question is this:  Given that we have 4:2 erasure coding, would the 
data rebalance evenly across the three OSD servers after we add 
store-03 such that if a single OSD server went down, the other two 
would be enough to keep the system online?  Like, with 4:2 erasure 
coding, would 2 shards go on store-01, then 2 shards on store-02, and 
then 2 shards on store-03?  Is that how I understand it?

Thanks for any insight!

-erich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx