Re: Ceph RGW performance guidelines

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Tue, 15 Oct 2024 10:26:45 -0400

> On Oct 15, 2024, at 9:28 AM, Harry Kominos <hkominos@xxxxxxxxx> wrote:
> 
> Hello Anthony and thank you for your response!
> 
> I have placed the requested info in a separate gist here:
> https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885

> 3826 pgs not deep-scrubbed in time
> 1501 pgs not scrubbed in time

Not surprising for HDDs.  Double your deep-scrub interval.

> Every OSD is an HDD, with their corresponding index, on a partition in an
> SSD device.

So you’re relying on the SSD DB device for the index pool?  Have you looked at your logs / metrics for those OSDs to see if there is any spillover?

What type of SSD are you using here?  And how many HDD OSDs do you have using each?

> And we are talking about 18 separate devices, with separate
> cluster_network for the rebalancing etc.

18 separate devices?  Do you mean 18 OSDs per server?  18 servers?  Or the fact that you’re using 18TB HDDs?

> The index for the RGW is also on an HDD (for now).

Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, then yeah any metadata ops are going to be dog slow.  Check that your OSDs actually do have external SSD DBs — it’s easy over the OSD lifecycle to deploy that way initially but to inadvertently rebuild OSDs without the external device.  

> Now as far as the number of pgs is concerned, I reached that number,
> through one of the calculators that are found online.

You’re using the autoscaler, I see.  

In your `ceph osd df` output, look at the PGS column at right.  Your balancer seems to be working fairly well.  Your average number of PG replicas per OSD is around 71, which is in alignment with upstream guidance.  

But I would suggest going twice as high.  See the very recent thread about PGs.  So I would adjust pg_num on pools in accordance with their usage and needs so that the PGS column there ends up in the 150 - 200 range.

> Since the cluster is doing Object store, Filesystem and Block storage, each pool has a different
> number for pg_num.
> In the RGW Data case, the pool has about 300TB in it , so perhaps that
> explains that the pg_num is lower than what you expected ?

Ah, mixed cluster.  You shoulda led with that ;)

default.rgw.buckets.data 356.7T 3.0 16440T 0.0651 1.0 4096 off False
default.rgw.buckets.index 5693M 3.0 16440T 0.0000 1.0 32 on False
default.rgw.buckets.non-ec 62769k 3.0 418.7T 0.0000 1.0 32 
volumes 8 16384 2.4 PiB 650.08M 7.2 PiB 53.80 2.1 PiB

You have three pools with appreciable data — the two RBD pools and your bucket pool.  Your pg_nums are more or less reflective of that, which is general guidance.

But the index pool is not about data or objects stored.  The index pool is mainly omaps not RADOS objects, and needs to be resourced differently.
Assuming that all 978 OSDs are identical media?  Your `ceph df` output though implies that you have OSDs on SSDs, so I’ll again request info on the media and how your OSDs are built.

Your index pool has only 32 PGs.  I suggest setting pg_num for that pool to, say, 1024.  It’ll take a while to split those PGs and you’ll see pgp_num slowly increasing, but when it’s done I strongly suspect that you’ll have better results.

The non-ec pool is mainly AIUI used for multipart uploads.  If your S3 objects are 4MB in size it probably doesn’t matter.  If you do start using MPU you’ll want to increase pg_num there too.

> 
> Regards,
> Harry
> 
> 
> 
> On Tue, Oct 15, 2024 at 2:54 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
> 
>> 
>> 
>>> Hello Ceph Community!
>>> 
>>> I have the following very interesting problem, for which I found no clear
>>> guidelines upstream so I am hoping to get some input from the mailing
>> list.
>>> I have a 6PB cluster in operation which is currently half full. The
>> cluster
>>> has around 1K OSD, and the RGW data pool  has 4096 pgs (and pgp_num).
>> 
>> Even without specifics I can tell you that pg_num is waaaaaaaaaaaaaay too
>> low.
>> 
>> Please send
>> 
>> `ceph -s`
>> `ceph osd tree | head -30`
>> `ceph osd df | head -10`
>> `ceph -v`
>> 
>> Also, tell us what media your index and bucket OSDs are on.
>> 
>>> The issue is as follows:
>>> Let's say that we have 10 million small objects (4MB) each.
>> 
>> In RGW terms, those are large objects.  Small objects would be 4KB.
>> 
>>> 1)Is there a performance difference *when fetching* between storing all
>> 10
>>> million objects in one bucket and storing 1 million in 10 buckets?
>> 
>> Larger buckets will generally be slower for some things, but if you’re on
>> Reef, and your bucket wasn’t created on an older release, 10 million
>> shouldn’t be too bad.  Listing larger buckets will always be increasingly
>> slower.
>> 
>>> There
>>> should be "some" because of the different number of pgs in use, in the 2
>>> scenarios but it is very hard to quantify.
>>> 
>>> 2) What if I have 100 million objects? Is there some theoretical limit /
>>> guideline on the number of objects that I should have in a bucket before
>> I
>>> see performance drops?
>> 
>> At that point, you might consider indexless buckets, if your
>> client/application can keep track of objects in its own DB.
>> 
>> With dynamic sharding (assuming you have it enabled), RGW defaults to
>> 100,000 objects per shard and 1999 max shards, so I *think* that after 199M
>> objects in a bucket it won’t auto-reshard.
>> 
>>> I should mention here that the contents of the bucket *never need to be
>>> listed, *The user always knows how to do a curl, to get the contents.
>> 
>> We can most likely improve your config, but you may also be a candidate
>> for an indexless bucket.  They don’t get a lot of press, and I won’t claim
>> to be expert in them, but it’s something to look into.
>> 
>> 
>>> 
>>> Thank you for your help,
>>> Harry
>>> 
>>> P.S.
>>> The following URLs have been very informative, but they do not answer my
>>> question unfortunately.
>>> 
>>> 
>> https://www.redhat.com/en/blog/red-hat-ceph-object-store-dell-emc-servers-part-1
>>> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx