Anthony, Darren Thanks for response. Answering your questions: What is the network you have for this cluster? 25GB/s > Is this a chassis with universal slots, or is that NVMe device maybe M.2 > or rear-cage? 12 * HDD via LSI jbod + 1 PCI NVME. Now it's 1.6TB, for the production plan is to use 3.2TB. `ceph df` > `ceph osd dump | grep pool` > So we can see what's going on HDD and what's on NVMe. --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 703 TiB 587 TiB 116 TiB 116 TiB 16.51 TOTAL 703 TiB 587 TiB 116 TiB 116 TiB 16.51 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL default.rgw.meta 52 64 6.0 KiB 13 131 KiB 0 177 TiB .mgr 54 32 28 MiB 8 83 MiB 0 177 TiB .rgw.root 55 64 2.0 KiB 4 48 KiB 0 177 TiB default.rgw.control 56 64 0 B 8 0 B 0 177 TiB default.rgw.buckets.index 59 32 34 MiB 33 102 MiB 0 177 TiB default.rgw.log 63 32 3.6 KiB 209 408 KiB 0 177 TiB default.rgw.buckets.non-ec 65 32 44 MiB 40 133 MiB 0 177 TiB 4_2_EC 67 1024 71 TiB 18.61M 106 TiB 16.61 355 TiB # ceph osd dump | grep pool pool 52 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change 18206 lfor 0/0/13123 flags hashpspool stripe_width 0 application rgw read_balance_score 5.27 pool 54 '.mgr' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 18206 lfor 0/0/13186 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 5.25 pool 55 '.rgw.root' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change 18206 lfor 0/0/13191 flags hashpspool stripe_width 0 application rgw read_balance_score 3.92 pool 56 'default.rgw.control' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change 18206 lfor 0/0/13200 flags hashpspool stripe_width 0 application rgw read_balance_score 6.55 pool 59 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 18206 lfor 0/0/13594 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 5.27 pool 63 'default.rgw.log' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 18397 lfor 0/0/18386 flags hashpspool stripe_width 0 application rgw read_balance_score 10.56 pool 65 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 18923 lfor 0/0/18921 flags hashpspool stripe_width 0 application rgw read_balance_score 7.89 pool 67 '4_2_EC' erasure profile 4_2 size 6 min_size 5 crush_rule 13 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 23570 flags hashpspool stripe_width 16384 application rgw You also have the metadata pools used by RGW that ideally need to be on > NVME. > Because you are using EC then there is the buckets.non-ec pool which is > used to manage the OMAPS for the multipart uploads this is usually down at > 8 PG’s and that will be limiting things as well. This part is very interesting. Some time ago I asked a similar question here <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YPHI5MF2CBQ2C7KYOJFG32A7N5HF2BXS/#7I6UDHWUK23JCKZSK25VXTRQYEDRFCPY>. Conclusion was that the index is covered by a bluestore. Should we consider removing a few HDDs and replacing them with SSD to non-ec pool? So now you have the question of do you have enough streams running in > parallel? Have you tried a benchmarking tool such as minio warp to see what > it can achieve. I think so, warp shows 1.6GiB/s for 20GB objects in 50 streams - acceptable. Changing the bluestone_min_alloc_size would be the last thing I would even > consider. In fact I wouldn’t be changing it as you are in untested > territory. ACK! :) Thanks! On Mon, 27 May 2024 at 09:27, Darren Soothill <darren.soothill@xxxxxxxx> wrote: > So a few questions I have around this. > > What is the network you have for this cluster? > > Changing the bluestone_min_alloc_size would be the last thing I would even > consider. In fact I wouldn’t be changing it as you are in untested > territory. > > The challenge with making these sort of things perform is to generate lots > of parallel streams so what ever is doing the uploading needs to be doing > parallel multipart uploads. There is no mention of the uploading code that > is being used. > > So with 7 Nodes each with 12 Disks and doing large files like this I would > be expecting to see 50-70MB/s per useable HDD. By useable I mean if you are > doing Replicas then you would divide the number of disks by the replica > number or in your case with EC I would be diving the number of disks by the > EC size and multiplying by the data part. So divide by 6 and multiply by 4. > > So allowing for EC overhead you in theory could get beyond 2.8GBytes/s > That is the theoretical disk limit I would be looking to exceed. > > So now you have the question of do you have enough streams running in > parallel? Have you tried a benchmarking tool such as minio warp to see what > it can achieve. > > You haven’t mentioned the number of PG’s you have for each of the pools in > question. You need to ensure that every pool that is being used has more > PG’s that the number of disks. If that’s not the case then individual disks > could be slowing things down. > > You also have the metadata pools used by RGW that ideally need to be on > NVME. > > Because you are using EC then there is the buckets.non-ec pool which is > used to manage the OMAPS for the multipart uploads this is usually down at > 8 PG’s and that will be limiting things as well. > > > > Darren Soothill > > Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9 > > Looking for help with your Ceph cluster? Contact us at https://croit.io/ > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx > > > > > On 25 May 2024, at 14:56, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > > Hi Everyone, > > I'm putting together a HDD cluster with an ECC pool dedicated to the backup > environment. Traffic via s3. Version 18.2, 7 OSD nodes, 12 * 12TB HDD + > 1NVME each, > > > QLC, man. QLC. That said, I hope you're going to use that single NVMe > SSD for at least the index pool. Is this a chassis with universal slots, > or is that NVMe device maybe M.2 or rear-cage? > > Wondering if there is some general guidance for startup setup/tuning in > regards to s3 object size. > > > Small objects are the devil of any object storage system. > > > Files are read from fast storage (SSD/NVME) and > written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic. > > > Nothing nonstandard about that, though your 1TB objects presumably are > going to be MPU. Having the .buckets.non-ec pool on HDD with objects that > large might be really slow to assemble them, you might need to increase > timeouts but I'm speculating. > > > Backup for big files took hours to complete. > > > Spinners gotta spin. They're a false economy. > > My first shot would be to increase default bluestore_min_alloc_size_hdd, to > reduce the number of stored objects, but I'm not sure if it's a > good direccion? > > > With that workload you *could* increase that to like 64KB, but I don't > think it'd gain you much. > > > Any other parameters worth checking to support such a > traffic pattern? > > > `ceph df` > `ceph osd dump | grep pool` > > So we can see what's going on HDD and what's on NVMe. > > > Thanks! > > -- > Łukasz > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > -- Łukasz Borek lukasz@xxxxxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx