Re: Question about erasure coding on cephfs

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Sat, 2 Mar 2024 10:56:38 -0500

> On Mar 2, 2024, at 10:37 AM, Erich Weiler <weiler@xxxxxxxxxxxx> wrote:
> 
> Hi Y'all,
> 
> We have a new ceph cluster online that looks like this:
> 
> md-01 : monitor, manager, mds
> md-02 : monitor, manager, mds
> md-03 : monitor, manager
> store-01 : twenty 30TB NVMe OSDs
> store-02 : twenty 30TB NVMe OSDs
> 
> The cephfs storage is using erasure coding at 4:2.  The crush domain is set to "osd".
> 
> (I know that's not optimal but let me get to that in a minute)
> 
> We have a current regular single NFS server (nfs-01) with the same storage as the OSD servers above (twenty 30TB NVME disks).  We want to wipe the NFS server and integrate it into the above ceph cluster as "store-03".  When we do that, we would then have three OSD servers.  We would then switch the crush domain to "host".
> 
> My question is this:  Given that we have 4:2 erasure coding, would the data rebalance evenly across the three OSD servers after we add store-03 such that if a single OSD server went down, the other two would be enough to keep the system online?  Like, with 4:2 erasure coding, would 2 shards go on store-01, then 2 shards on store-02, and then 2 shards on store-03?  Is that how I understand it?

Nope.  If the failure domain is *host*, without a carefully-crafted special CRUSH rule, CRUSH will want to spread the 6 shards over 6 failure domains, and you will only have 3.  I don’t remember for sure if the PGs would be stuck remapped or stuck unable to activate, but either way you would have a very bad day.

Say you craft a CRUSH rule that places two shards on each host.  One host goes down, and you have at most K shards up.  IIRC the PGs will be `inactive` but you won’t lose existing data. 

Here sind multiple reasons why for small deployments I favor 1U servers:

* Having enough servers so that if one is down, service can proceed
* Having enough failure domains to do EC — or at least replication — safely
* Is your networking a bottleneck?

Assuming these are QLC SSDs, do you have their min_alloc_size set to match the IU?  Ideally you would mix in a couple of TLC OSDs on each server — in this case including the control plane — for the CephFS metadata pool.

I’m curious which SSDs you’re using, please write me privately as I have history with QLC.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx