> On Mar 2, 2024, at 10:37 AM, Erich Weiler <weiler@xxxxxxxxxxxx> wrote: > > Hi Y'all, > > We have a new ceph cluster online that looks like this: > > md-01 : monitor, manager, mds > md-02 : monitor, manager, mds > md-03 : monitor, manager > store-01 : twenty 30TB NVMe OSDs > store-02 : twenty 30TB NVMe OSDs > > The cephfs storage is using erasure coding at 4:2. The crush domain is set to "osd". > > (I know that's not optimal but let me get to that in a minute) > > We have a current regular single NFS server (nfs-01) with the same storage as the OSD servers above (twenty 30TB NVME disks). We want to wipe the NFS server and integrate it into the above ceph cluster as "store-03". When we do that, we would then have three OSD servers. We would then switch the crush domain to "host". > > My question is this: Given that we have 4:2 erasure coding, would the data rebalance evenly across the three OSD servers after we add store-03 such that if a single OSD server went down, the other two would be enough to keep the system online? Like, with 4:2 erasure coding, would 2 shards go on store-01, then 2 shards on store-02, and then 2 shards on store-03? Is that how I understand it? Nope. If the failure domain is *host*, without a carefully-crafted special CRUSH rule, CRUSH will want to spread the 6 shards over 6 failure domains, and you will only have 3. I don’t remember for sure if the PGs would be stuck remapped or stuck unable to activate, but either way you would have a very bad day. Say you craft a CRUSH rule that places two shards on each host. One host goes down, and you have at most K shards up. IIRC the PGs will be `inactive` but you won’t lose existing data. Here sind multiple reasons why for small deployments I favor 1U servers: * Having enough servers so that if one is down, service can proceed * Having enough failure domains to do EC — or at least replication — safely * Is your networking a bottleneck? Assuming these are QLC SSDs, do you have their min_alloc_size set to match the IU? Ideally you would mix in a couple of TLC OSDs on each server — in this case including the control plane — for the CephFS metadata pool. I’m curious which SSDs you’re using, please write me privately as I have history with QLC. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx