Re: tuning for backup target cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Anthony, Darren
Thanks for response.

Answering your questions:

What is the network you have for this cluster?

25GB/s

> Is this a chassis with universal slots, or is that NVMe device maybe M.2
> or rear-cage?

12 * HDD via LSI jbod + 1 PCI NVME. Now it's 1.6TB, for the production plan
is to use 3.2TB.


`ceph df`
> `ceph osd dump | grep pool`
> So we can see what's going on HDD and what's on NVMe.


--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    703 TiB  587 TiB  116 TiB   116 TiB      16.51
TOTAL  703 TiB  587 TiB  116 TiB   116 TiB      16.51

--- POOLS ---
POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX
AVAIL
default.rgw.meta            52    64  6.0 KiB       13  131 KiB      0
 177 TiB
.mgr                        54    32   28 MiB        8   83 MiB      0
 177 TiB
.rgw.root                   55    64  2.0 KiB        4   48 KiB      0
 177 TiB
default.rgw.control         56    64      0 B        8      0 B      0
 177 TiB
default.rgw.buckets.index   59    32   34 MiB       33  102 MiB      0
 177 TiB
default.rgw.log             63    32  3.6 KiB      209  408 KiB      0
 177 TiB
default.rgw.buckets.non-ec  65    32   44 MiB       40  133 MiB      0
 177 TiB
4_2_EC                      67  1024   71 TiB   18.61M  106 TiB  16.61
 355 TiB

# ceph osd dump | grep pool
pool 52 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 6
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change
18206 lfor 0/0/13123 flags hashpspool stripe_width 0 application rgw
read_balance_score 5.27
pool 54 '.mgr' replicated size 3 min_size 2 crush_rule 6 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 18206 lfor
0/0/13186 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
application mgr read_balance_score 5.25
pool 55 '.rgw.root' replicated size 3 min_size 2 crush_rule 6 object_hash
rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change 18206 lfor
0/0/13191 flags hashpspool stripe_width 0 application rgw
read_balance_score 3.92
pool 56 'default.rgw.control' replicated size 3 min_size 2 crush_rule 6
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode off last_change
18206 lfor 0/0/13200 flags hashpspool stripe_width 0 application rgw
read_balance_score 6.55
pool 59 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
18206 lfor 0/0/13594 flags hashpspool stripe_width 0 pg_autoscale_bias 4
application rgw read_balance_score 5.27
pool 63 'default.rgw.log' replicated size 3 min_size 2 crush_rule 6
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
18397 lfor 0/0/18386 flags hashpspool stripe_width 0 application rgw
read_balance_score 10.56
pool 65 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 18923 lfor 0/0/18921 flags hashpspool stripe_width 0
application rgw read_balance_score 7.89
pool 67 '4_2_EC' erasure profile 4_2 size 6 min_size 5 crush_rule 13
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off
last_change 23570 flags hashpspool stripe_width 16384 application rgw


You also have the metadata pools used by RGW that ideally need to be on
> NVME.
> Because you are using EC then there is the buckets.non-ec pool which is
> used to manage the OMAPS for the multipart uploads this is usually down at
> 8 PG’s and that will be limiting things as well.

This part is very interesting. Some time ago I asked a similar question here
<https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YPHI5MF2CBQ2C7KYOJFG32A7N5HF2BXS/#7I6UDHWUK23JCKZSK25VXTRQYEDRFCPY>.
Conclusion was that the index is covered by a bluestore. Should we consider
removing a few HDDs and replacing them with SSD to non-ec pool?

So now you have the question of do you have enough streams running in
> parallel? Have you tried a benchmarking tool such as minio warp to see what
> it can achieve.

I think so, warp shows 1.6GiB/s for 20GB  objects in 50 streams -
acceptable.

Changing the bluestone_min_alloc_size would be the last thing I would even
> consider. In fact I wouldn’t be changing it as you are in untested
> territory.

ACK! :)

Thanks!

On Mon, 27 May 2024 at 09:27, Darren Soothill <darren.soothill@xxxxxxxx>
wrote:

> So a few questions I have around this.
>
> What is the network you have for this cluster?
>
> Changing the bluestone_min_alloc_size would be the last thing I would even
> consider. In fact I wouldn’t be changing it as you are in untested
> territory.
>
> The challenge with making these sort of things perform is to generate lots
> of parallel streams so what ever is doing the uploading needs to be doing
> parallel multipart uploads. There is no mention of the uploading code that
> is being used.
>
> So with 7 Nodes each with 12 Disks and doing large files like this I would
> be expecting to see 50-70MB/s per useable HDD. By useable I mean if you are
> doing Replicas then you would divide the number of disks by the replica
> number or in your case with EC I would be diving the number of disks by the
> EC size and multiplying by the data part. So divide by 6 and multiply by 4.
>
> So allowing for EC overhead you in theory could get beyond 2.8GBytes/s
> That is the theoretical disk limit I would be looking to exceed.
>
> So now you have the question of do you have enough streams running in
> parallel? Have you tried a benchmarking tool such as minio warp to see what
> it can achieve.
>
> You haven’t mentioned the number of PG’s you have for each of the pools in
> question. You need to ensure that every pool that is being used has more
> PG’s that the number of disks. If that’s not the case then individual disks
> could be slowing things down.
>
> You also have the metadata pools used by RGW that ideally need to be on
> NVME.
>
> Because you are using EC then there is the buckets.non-ec pool which is
> used to manage the OMAPS for the multipart uploads this is usually down at
> 8 PG’s and that will be limiting things as well.
>
>
>
> Darren Soothill
>
> Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>
>
>
>
> On 25 May 2024, at 14:56, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
>
>
> Hi Everyone,
>
> I'm putting together a HDD cluster with an ECC pool dedicated to the backup
> environment. Traffic via s3. Version 18.2,  7 OSD nodes, 12 * 12TB HDD +
> 1NVME each,
>
>
> QLC, man.  QLC.  That said, I hope you're going to use that single NVMe
> SSD for at least the index pool.  Is this a chassis with universal slots,
> or is that NVMe device maybe M.2 or rear-cage?
>
> Wondering if there is some general guidance for startup setup/tuning in
> regards to s3 object size.
>
>
> Small objects are the devil of any object storage system.
>
>
> Files are read from fast storage (SSD/NVME) and
> written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic.
>
>
> Nothing nonstandard about that, though your 1TB objects presumably are
> going to be MPU.  Having the .buckets.non-ec pool on HDD with objects that
> large might be really slow to assemble them, you might need to increase
> timeouts but I'm speculating.
>
>
> Backup for big files took hours to complete.
>
>
> Spinners gotta spin.  They're a false economy.
>
> My first shot would be to increase default bluestore_min_alloc_size_hdd, to
> reduce the number of stored objects, but I'm not sure if it's a
> good direccion?
>
>
> With that workload you *could* increase that to like 64KB, but I don't
> think it'd gain you much.
>
>
> Any other parameters worth checking to support such a
> traffic pattern?
>
>
> `ceph df`
> `ceph osd dump | grep pool`
>
> So we can see what's going on HDD and what's on NVMe.
>
>
> Thanks!
>
> --
> Łukasz
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>

-- 
Łukasz Borek
lukasz@xxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux