Le 20/11/2023 à 09:24:41+0000, Frank Schilder a écrit Hi, Thanks everyone for your answer. > > we are using something similar for ceph-fs. For a backup system your setup can work, depending on how you back up. While HDD pools have poor IOP/s performance, they are very good for streaming workloads. If you are using something like Borg backup that writes huge files sequentially, a HDD back-end should be OK. > Ok. Good to know > Here some things to consider and try out: > > 1. You really need to get a bunch of enterprise SSDs with power loss protection for the FS meta data pool (disable write cache if enabled, this will disable volatile write cache and switch to protected caching). We are using (formerly Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to raise performance. Place the meta-data pool and the primary data pool on these disks. Create a secondary data pool on the HDDs and assign it to the root *before* creating anything on the FS (see the recommended 3-pool layout for ceph file systems in the docs). I would not even consider running this without SSDs. 1 such SSD per host is the minimum, 2 is better. If Borg or whatever can make use of a small fast storage directory, assign a sub-dir of the root to the primary data pool. OK. I will see what I can do. > > 2. Calculate with sufficient extra disk space. As long as utilization stays below 60-70% bluestore will try to make large object writes sequential, which is really important for HDDs. On our cluster we currently have 40% utilization and I get full HDD bandwidth out for large sequential reads/writes. Make sure your backup application makes large sequential IO requests. > > 3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use ephemeral pinning. Depending on the CPUs you are using, 48 cores can be plenty. The latest generation Intel Xeon Scalable Processors is so efficient with ceph that 1HT per HDD is more than enough. Yes I get 512G on each node, 64 core on each server. > > 4. 3 MON+MGR nodes are sufficient. You can do something else with the remaining 2 nodes. Of course, you can use them as additional MON+MGR nodes. We also use 5 and it improves maintainability a lot. > Ok thanks. > Something more exotic if you have time: > > 5. To improve sequential performance further, you can experiment with larger min_alloc_sizes for OSDs (on creation time, you will need to scrap and re-deploy the cluster to test different values). Every HDD has a preferred IO-size for which random IO achieves nearly the same band-with as sequential writes. (But see 7.) > > 6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is the preferred IO size (usually between 256K-1M). (But see 7.) > > 7. Important: large min_alloc_sizes are only good if your workload *never* modifies files, but only replaces them. A bit like a pool without EC overwrite enabled. The implementation of EC overwrites has a "feature" that can lead to massive allocation amplification. If your backup workload does modifications to files instead of adding new+deleting old, do *not* experiment with options 5.-7. Instead, use the default and make sure you have sufficient unused capacity to increase the chances for large bluestore writes (keep utilization below 60-70% and just buy extra disks). A workload with large min_alloc_sizes has to be S3-like, only upload, download and delete are allowed. Thankt a lot for those tips. I'm newbie with ceph so it's going to take sometime before I understand everything you say. Best regards -- Albert SHIH 🦫 🐸 France Heure locale/Local time: jeu. 23 nov. 2023 08:32:20 CET _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx