Re: tuning for backup target cluster

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Mon, 27 May 2024 18:05:28 -0400

> 
>> Is this a chassis with universal slots, or is that NVMe device maybe M.2
>> or rear-cage?
> 
> 12 * HDD via LSI jbod + 1 PCI NVME.

All NVMe devices are PCI ;).

> Now it's 1.6TB, for the production plan
> is to use 3.2TB.
> 
> 
> `ceph df`
>> `ceph osd dump | grep pool`
>> So we can see what's going on HDD and what's on NVMe.
> 
> 
> --- RAW STORAGE ---
> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
> hdd    703 TiB  587 TiB  116 TiB   116 TiB      16.51
> TOTAL  703 TiB  587 TiB  116 TiB   116 TiB      16.51
> 
> --- POOLS ---
> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX
> AVAIL
> default.rgw.meta            52    64  6.0 KiB       13  131 KiB      0 177 TiB
> .mgr                        54    32   28 MiB        8   83 MiB      0 177 TiB
> .rgw.root                   55    64  2.0 KiB        4   48 KiB      0 177 TiB
> default.rgw.control         56    64      0 B        8      0 B      0 177 TiB
> default.rgw.buckets.index   59    32   34 MiB       33  102 MiB      0 177 TiB
> default.rgw.log             63    32  3.6 KiB      209  408 KiB      0 177 TiB
> default.rgw.buckets.non-ec  65    32   44 MiB       40  133 MiB      0 177 TiB
> 4_2_EC                      67  1024   71 TiB   18.61M  106 TiB  16.61 355 TiB

So *everything* is on the HDDs?

I suggest disabling the pg autoscaler and adjusting your pg_num values.  If I calculate correctly `ceph osd df` should show + / - 82 on each OSD.  I would target double that number.

As a start maybe raise buckets.index and buckets.non-ec to 256.

> 
> You also have the metadata pools used by RGW that ideally need to be on
>> NVME.
>> Because you are using EC then there is the buckets.non-ec pool which is
>> used to manage the OMAPS for the multipart uploads this is usually down at
>> 8 PG’s and that will be limiting things as well.
> 
> This part is very interesting. Some time ago I asked a similar question here
> <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YPHI5MF2CBQ2C7KYOJFG32A7N5HF2BXS/#7I6UDHWUK23JCKZSK25VXTRQYEDRFCPY>.
> Conclusion was that the index is covered by a bluestore.

hmmm.  Since you don’t have any pools on that single NVMe drive I gather that you’re using it for WAL+DB, and you’ll have ~132GB sliced per OSD?  Interesting idea.  You’ll want to watch the RocksDB usage on those OSDs to ensure you aren’t spilling onto the slow device.

> Should we consider
> removing a few HDDs and replacing them with SSD to non-ec pool?
> 
> So now you have the question of do you have enough streams running in
>> parallel? Have you tried a benchmarking tool such as minio warp to see what
>> it can achieve.
> 
> I think so, warp shows 1.6GiB/s for 20GB  objects in 50 streams -
> acceptable.
> 
> Changing the bluestone_min_alloc_size would be the last thing I would even
>> consider. In fact I wouldn’t be changing it as you are in untested
>> territory.
> 
> ACK! :)

It used to be 64KB.  Back around Octopus / Pacific it changed to 4KB for both rotational and non-rotational devices, in part due to the space amp that RGW users with small objects experienced:

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253;
Bluestore Space Amplification Cheat Sheet
docs.google.com

In your case I don’t think it would hurt to rebuild all of your OSDs with a value of 64KB, but I don’t think with modern code it would buy you much.

> 
> Thanks!
> 
> On Mon, 27 May 2024 at 09:27, Darren Soothill <darren.soothill@xxxxxxxx>
> wrote:
> 
>> So a few questions I have around this.
>> 
>> What is the network you have for this cluster?
>> 
>> Changing the bluestone_min_alloc_size would be the last thing I would even
>> consider. In fact I wouldn’t be changing it as you are in untested
>> territory.
>> 
>> The challenge with making these sort of things perform is to generate lots
>> of parallel streams so what ever is doing the uploading needs to be doing
>> parallel multipart uploads. There is no mention of the uploading code that
>> is being used.
>> 
>> So with 7 Nodes each with 12 Disks and doing large files like this I would
>> be expecting to see 50-70MB/s per useable HDD. By useable I mean if you are
>> doing Replicas then you would divide the number of disks by the replica
>> number or in your case with EC I would be diving the number of disks by the
>> EC size and multiplying by the data part. So divide by 6 and multiply by 4.
>> 
>> So allowing for EC overhead you in theory could get beyond 2.8GBytes/s
>> That is the theoretical disk limit I would be looking to exceed.
>> 
>> So now you have the question of do you have enough streams running in
>> parallel? Have you tried a benchmarking tool such as minio warp to see what
>> it can achieve.
>> 
>> You haven’t mentioned the number of PG’s you have for each of the pools in
>> question. You need to ensure that every pool that is being used has more
>> PG’s that the number of disks. If that’s not the case then individual disks
>> could be slowing things down.
>> 
>> You also have the metadata pools used by RGW that ideally need to be on
>> NVME.
>> 
>> Because you are using EC then there is the buckets.non-ec pool which is
>> used to manage the OMAPS for the multipart uploads this is usually down at
>> 8 PG’s and that will be limiting things as well.
>> 
>> 
>> 
>> Darren Soothill
>> 
>> Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>> 
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>> 
>> 
>> 
>> 
>> On 25 May 2024, at 14:56, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>> 
>> 
>> 
>> Hi Everyone,
>> 
>> I'm putting together a HDD cluster with an ECC pool dedicated to the backup
>> environment. Traffic via s3. Version 18.2,  7 OSD nodes, 12 * 12TB HDD +
>> 1NVME each,
>> 
>> 
>> QLC, man.  QLC.  That said, I hope you're going to use that single NVMe
>> SSD for at least the index pool.  Is this a chassis with universal slots,
>> or is that NVMe device maybe M.2 or rear-cage?
>> 
>> Wondering if there is some general guidance for startup setup/tuning in
>> regards to s3 object size.
>> 
>> 
>> Small objects are the devil of any object storage system.
>> 
>> 
>> Files are read from fast storage (SSD/NVME) and
>> written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic.
>> 
>> 
>> Nothing nonstandard about that, though your 1TB objects presumably are
>> going to be MPU.  Having the .buckets.non-ec pool on HDD with objects that
>> large might be really slow to assemble them, you might need to increase
>> timeouts but I'm speculating.
>> 
>> 
>> Backup for big files took hours to complete.
>> 
>> 
>> Spinners gotta spin.  They're a false economy.
>> 
>> My first shot would be to increase default bluestore_min_alloc_size_hdd, to
>> reduce the number of stored objects, but I'm not sure if it's a
>> good direccion?
>> 
>> 
>> With that workload you *could* increase that to like 64KB, but I don't
>> think it'd gain you much.
>> 
>> 
>> Any other parameters worth checking to support such a
>> traffic pattern?
>> 
>> 
>> `ceph df`
>> `ceph osd dump | grep pool`
>> 
>> So we can see what's going on HDD and what's on NVMe.
>> 
>> 
>> Thanks!
>> 
>> --
>> Łukasz
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> 
>> 
> 
> -- 
> Łukasz Borek
> lukasz@xxxxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx