> >> Is this a chassis with universal slots, or is that NVMe device maybe M.2 >> or rear-cage? > > 12 * HDD via LSI jbod + 1 PCI NVME. All NVMe devices are PCI ;). > Now it's 1.6TB, for the production plan > is to use 3.2TB. > > > `ceph df` >> `ceph osd dump | grep pool` >> So we can see what's going on HDD and what's on NVMe. > > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 703 TiB 587 TiB 116 TiB 116 TiB 16.51 > TOTAL 703 TiB 587 TiB 116 TiB 116 TiB 16.51 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > default.rgw.meta 52 64 6.0 KiB 13 131 KiB 0 177 TiB > .mgr 54 32 28 MiB 8 83 MiB 0 177 TiB > .rgw.root 55 64 2.0 KiB 4 48 KiB 0 177 TiB > default.rgw.control 56 64 0 B 8 0 B 0 177 TiB > default.rgw.buckets.index 59 32 34 MiB 33 102 MiB 0 177 TiB > default.rgw.log 63 32 3.6 KiB 209 408 KiB 0 177 TiB > default.rgw.buckets.non-ec 65 32 44 MiB 40 133 MiB 0 177 TiB > 4_2_EC 67 1024 71 TiB 18.61M 106 TiB 16.61 355 TiB So *everything* is on the HDDs? I suggest disabling the pg autoscaler and adjusting your pg_num values. If I calculate correctly `ceph osd df` should show + / - 82 on each OSD. I would target double that number. As a start maybe raise buckets.index and buckets.non-ec to 256. > > You also have the metadata pools used by RGW that ideally need to be on >> NVME. >> Because you are using EC then there is the buckets.non-ec pool which is >> used to manage the OMAPS for the multipart uploads this is usually down at >> 8 PG’s and that will be limiting things as well. > > This part is very interesting. Some time ago I asked a similar question here > <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YPHI5MF2CBQ2C7KYOJFG32A7N5HF2BXS/#7I6UDHWUK23JCKZSK25VXTRQYEDRFCPY>. > Conclusion was that the index is covered by a bluestore. hmmm. Since you don’t have any pools on that single NVMe drive I gather that you’re using it for WAL+DB, and you’ll have ~132GB sliced per OSD? Interesting idea. You’ll want to watch the RocksDB usage on those OSDs to ensure you aren’t spilling onto the slow device. > Should we consider > removing a few HDDs and replacing them with SSD to non-ec pool? > > So now you have the question of do you have enough streams running in >> parallel? Have you tried a benchmarking tool such as minio warp to see what >> it can achieve. > > I think so, warp shows 1.6GiB/s for 20GB objects in 50 streams - > acceptable. > > Changing the bluestone_min_alloc_size would be the last thing I would even >> consider. In fact I wouldn’t be changing it as you are in untested >> territory. > > ACK! :) It used to be 64KB. Back around Octopus / Pacific it changed to 4KB for both rotational and non-rotational devices, in part due to the space amp that RGW users with small objects experienced: https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253; Bluestore Space Amplification Cheat Sheet docs.google.com In your case I don’t think it would hurt to rebuild all of your OSDs with a value of 64KB, but I don’t think with modern code it would buy you much. > > Thanks! > > On Mon, 27 May 2024 at 09:27, Darren Soothill <darren.soothill@xxxxxxxx> > wrote: > >> So a few questions I have around this. >> >> What is the network you have for this cluster? >> >> Changing the bluestone_min_alloc_size would be the last thing I would even >> consider. In fact I wouldn’t be changing it as you are in untested >> territory. >> >> The challenge with making these sort of things perform is to generate lots >> of parallel streams so what ever is doing the uploading needs to be doing >> parallel multipart uploads. There is no mention of the uploading code that >> is being used. >> >> So with 7 Nodes each with 12 Disks and doing large files like this I would >> be expecting to see 50-70MB/s per useable HDD. By useable I mean if you are >> doing Replicas then you would divide the number of disks by the replica >> number or in your case with EC I would be diving the number of disks by the >> EC size and multiplying by the data part. So divide by 6 and multiply by 4. >> >> So allowing for EC overhead you in theory could get beyond 2.8GBytes/s >> That is the theoretical disk limit I would be looking to exceed. >> >> So now you have the question of do you have enough streams running in >> parallel? Have you tried a benchmarking tool such as minio warp to see what >> it can achieve. >> >> You haven’t mentioned the number of PG’s you have for each of the pools in >> question. You need to ensure that every pool that is being used has more >> PG’s that the number of disks. If that’s not the case then individual disks >> could be slowing things down. >> >> You also have the metadata pools used by RGW that ideally need to be on >> NVME. >> >> Because you are using EC then there is the buckets.non-ec pool which is >> used to manage the OMAPS for the multipart uploads this is usually down at >> 8 PG’s and that will be limiting things as well. >> >> >> >> Darren Soothill >> >> Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9 >> >> Looking for help with your Ceph cluster? Contact us at https://croit.io/ >> >> croit GmbH, Freseniusstr. 31h, 81247 Munich >> CEO: Martin Verges - VAT-ID: DE310638492 >> Com. register: Amtsgericht Munich HRB 231263 >> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx >> >> >> >> >> On 25 May 2024, at 14:56, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: >> >> >> >> Hi Everyone, >> >> I'm putting together a HDD cluster with an ECC pool dedicated to the backup >> environment. Traffic via s3. Version 18.2, 7 OSD nodes, 12 * 12TB HDD + >> 1NVME each, >> >> >> QLC, man. QLC. That said, I hope you're going to use that single NVMe >> SSD for at least the index pool. Is this a chassis with universal slots, >> or is that NVMe device maybe M.2 or rear-cage? >> >> Wondering if there is some general guidance for startup setup/tuning in >> regards to s3 object size. >> >> >> Small objects are the devil of any object storage system. >> >> >> Files are read from fast storage (SSD/NVME) and >> written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic. >> >> >> Nothing nonstandard about that, though your 1TB objects presumably are >> going to be MPU. Having the .buckets.non-ec pool on HDD with objects that >> large might be really slow to assemble them, you might need to increase >> timeouts but I'm speculating. >> >> >> Backup for big files took hours to complete. >> >> >> Spinners gotta spin. They're a false economy. >> >> My first shot would be to increase default bluestore_min_alloc_size_hdd, to >> reduce the number of stored objects, but I'm not sure if it's a >> good direccion? >> >> >> With that workload you *could* increase that to like 64KB, but I don't >> think it'd gain you much. >> >> >> Any other parameters worth checking to support such a >> traffic pattern? >> >> >> `ceph df` >> `ceph osd dump | grep pool` >> >> So we can see what's going on HDD and what's on NVMe. >> >> >> Thanks! >> >> -- >> Łukasz >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> > > -- > Łukasz Borek > lukasz@xxxxxxxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx