Re: Ceph RGW performance guidelines

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 21 Oct 2024 13:27:41 -0400

> > Not surprising for HDDs.  Double your deep-scrub interval.
> 
> Done!

If your PG ratio is low, say <200, bumping pg_num may help as well.  Oh yeah, looking up your gist from a prior message, you average around 70 PG replicas per OSD.  Aim for 200.

Your index pool has waaaaay too few PGs.  Set pg_num to 1024.  I’d jack up your buckets.data pool to at least 8192 as well.  If you do any MPU at all, I’d raise non-ec to 512 or 1024.

> 
> > So you’re relying on the SSD DB device for the index pool?  Have you looked at your logs / metrics for those OSDs to see if there is any spillover?
> > What type of SSD are you using here?  And how many HDD OSDs do you have using each? 
> 
> I will try to describe the system as best I can, We are talking about 18 different hosts. Each host has a large number of HDDs, and a small number of SSDs (4),
> Out of these SSDs, 2 are used as the backend, to a high speed volume-ssd pool, that certain VMs write into, and the other 2 are split into very large LVM partitions, which act as the journal for the HDDs,

As I suspected.

> I have amended the gist to add that extra information from lsblk. I have not added any information regarding disk models etc. But from the top of my head, each HDD should be about 16T in size, and the NVME is also extremely large and built for high-I/O systems.

There are NVMe devices available that decidedly are not suited for this purpose.  The usual rule of thumb that I’ve seen when using TLC-class NVMe WAL+DB devices is a max ratio of 10:1 to spinners.  You seem to have 21:1 .

> Each db_devices, if you see in the lsblk, is extremely large so I think there is no spillover.

675GB is the largest WAL+DB partition I've ever seen.

> 
> > Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, then yeah any metadata ops are going to be dog slow.  Check that your OSDs actually do have external SSD DBs — it’s easy over the OSD lifecycle to deploy that way > > initially but to inadvertently rebuild OSDs without the external device. 
> 
> I will investigate

`ceph osd metadata` | a suitable grep may show if you have OSDs that aren’t actually using the offboard WAL+DB partition.

> and I will start by planning a new pg bump which takes forever due to the size of the cluster for the volumes pool

It takes forever because you have spinners ;). And because with recent Ceph releases the cluster throttles the (expensive) PG splitting to prevent DoS.  Splitting all the PGs at once can be … impactful.

>  AND somehow move the index pool to an osd device before bumping.

Is it only on dedicated NVMes right now?  Which would be what, 36 OSDs?  

With your WAL+DB SSDs having a 21:1 ratio, using them for the index pool instead / in addition may or may not improve your performance, but you could always move back.

> All this is excellent advice which I thank you for.
> 
> I would like now to ask your opinion on the original query, 
> 
> Do you think that there is some palpable difference between 1 bucket with 10 million objects, and 10 buckets with 1 million objects each?

Depends on what you’re measuring.  The second case I suspect would list bucket contents faster.

> Intuitively, I feel that the first case would mean interacting with far fewer pgs than the second (10 times less?) which spreads the load on more devices, but my knowledge of ceph internals is nearly 0.
> 
> 
> Regards,
> Harry
> 
> 
> 
> On Tue, Oct 15, 2024 at 4:26 PM Anthony D'Atri <anthony.datri@xxxxxxxxx <mailto:anthony.datri@xxxxxxxxx>> wrote:
>> 
>> 
>> > On Oct 15, 2024, at 9:28 AM, Harry Kominos <hkominos@xxxxxxxxx <mailto:hkominos@xxxxxxxxx>> wrote:
>> > 
>> > Hello Anthony and thank you for your response!
>> > 
>> > I have placed the requested info in a separate gist here:
>> > https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885
>> 
>> > 3826 pgs not deep-scrubbed in time
>> > 1501 pgs not scrubbed in time
>> 
>> Not surprising for HDDs.  Double your deep-scrub interval.
>> 
>> > Every OSD is an HDD, with their corresponding index, on a partition in an
>> > SSD device.
>> 
>> 
>> So you’re relying on the SSD DB device for the index pool?  Have you looked at your logs / metrics for those OSDs to see if there is any spillover?
>> 
>> What type of SSD are you using here?  And how many HDD OSDs do you have using each?
>> 
>> 
>> > And we are talking about 18 separate devices, with separate
>> > cluster_network for the rebalancing etc.
>> 
>> 
>> 18 separate devices?  Do you mean 18 OSDs per server?  18 servers?  Or the fact that you’re using 18TB HDDs?
>> 
>> > The index for the RGW is also on an HDD (for now).
>> 
>> Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, then yeah any metadata ops are going to be dog slow.  Check that your OSDs actually do have external SSD DBs — it’s easy over the OSD lifecycle to deploy that way initially but to inadvertently rebuild OSDs without the external device.  
>> 
>> > Now as far as the number of pgs is concerned, I reached that number,
>> > through one of the calculators that are found online.
>> 
>> You’re using the autoscaler, I see.  
>> 
>> In your `ceph osd df` output, look at the PGS column at right.  Your balancer seems to be working fairly well.  Your average number of PG replicas per OSD is around 71, which is in alignment with upstream guidance.  
>> 
>> But I would suggest going twice as high.  See the very recent thread about PGs.  So I would adjust pg_num on pools in accordance with their usage and needs so that the PGS column there ends up in the 150 - 200 range.
>> 
>> > Since the cluster is doing Object store, Filesystem and Block storage, each pool has a different
>> > number for pg_num.
>> > In the RGW Data case, the pool has about 300TB in it , so perhaps that
>> > explains that the pg_num is lower than what you expected ?
>> 
>> Ah, mixed cluster.  You shoulda led with that ;)
>> 
>> default.rgw.buckets.data 356.7T 3.0 16440T 0.0651 1.0 4096 off False
>> default.rgw.buckets.index 5693M 3.0 16440T 0.0000 1.0 32 on False
>> default.rgw.buckets.non-ec 62769k 3.0 418.7T 0.0000 1.0 32 
>> volumes 8 16384 2.4 PiB 650.08M 7.2 PiB 53.80 2.1 PiB
>> 
>> You have three pools with appreciable data — the two RBD pools and your bucket pool.  Your pg_nums are more or less reflective of that, which is general guidance.
>> 
>> But the index pool is not about data or objects stored.  The index pool is mainly omaps not RADOS objects, and needs to be resourced differently.
>> Assuming that all 978 OSDs are identical media?  Your `ceph df` output though implies that you have OSDs on SSDs, so I’ll again request info on the media and how your OSDs are built.
>> 
>> 
>> Your index pool has only 32 PGs.  I suggest setting pg_num for that pool to, say, 1024.  It’ll take a while to split those PGs and you’ll see pgp_num slowly increasing, but when it’s done I strongly suspect that you’ll have better results.
>> 
>> The non-ec pool is mainly AIUI used for multipart uploads.  If your S3 objects are 4MB in size it probably doesn’t matter.  If you do start using MPU you’ll want to increase pg_num there too.
>> 
>> 
>> > 
>> > Regards,
>> > Harry
>> > 
>> > 
>> > 
>> > On Tue, Oct 15, 2024 at 2:54 PM Anthony D'Atri <anthony.datri@xxxxxxxxx <mailto:anthony.datri@xxxxxxxxx>>
>> > wrote:
>> > 
>> >> 
>> >> 
>> >>> Hello Ceph Community!
>> >>> 
>> >>> I have the following very interesting problem, for which I found no clear
>> >>> guidelines upstream so I am hoping to get some input from the mailing
>> >> list.
>> >>> I have a 6PB cluster in operation which is currently half full. The
>> >> cluster
>> >>> has around 1K OSD, and the RGW data pool  has 4096 pgs (and pgp_num).
>> >> 
>> >> Even without specifics I can tell you that pg_num is waaaaaaaaaaaaaay too
>> >> low.
>> >> 
>> >> Please send
>> >> 
>> >> `ceph -s`
>> >> `ceph osd tree | head -30`
>> >> `ceph osd df | head -10`
>> >> `ceph -v`
>> >> 
>> >> Also, tell us what media your index and bucket OSDs are on.
>> >> 
>> >>> The issue is as follows:
>> >>> Let's say that we have 10 million small objects (4MB) each.
>> >> 
>> >> In RGW terms, those are large objects.  Small objects would be 4KB.
>> >> 
>> >>> 1)Is there a performance difference *when fetching* between storing all
>> >> 10
>> >>> million objects in one bucket and storing 1 million in 10 buckets?
>> >> 
>> >> Larger buckets will generally be slower for some things, but if you’re on
>> >> Reef, and your bucket wasn’t created on an older release, 10 million
>> >> shouldn’t be too bad.  Listing larger buckets will always be increasingly
>> >> slower.
>> >> 
>> >>> There
>> >>> should be "some" because of the different number of pgs in use, in the 2
>> >>> scenarios but it is very hard to quantify.
>> >>> 
>> >>> 2) What if I have 100 million objects? Is there some theoretical limit /
>> >>> guideline on the number of objects that I should have in a bucket before
>> >> I
>> >>> see performance drops?
>> >> 
>> >> At that point, you might consider indexless buckets, if your
>> >> client/application can keep track of objects in its own DB.
>> >> 
>> >> With dynamic sharding (assuming you have it enabled), RGW defaults to
>> >> 100,000 objects per shard and 1999 max shards, so I *think* that after 199M
>> >> objects in a bucket it won’t auto-reshard.
>> >> 
>> >>> I should mention here that the contents of the bucket *never need to be
>> >>> listed, *The user always knows how to do a curl, to get the contents.
>> >> 
>> >> We can most likely improve your config, but you may also be a candidate
>> >> for an indexless bucket.  They don’t get a lot of press, and I won’t claim
>> >> to be expert in them, but it’s something to look into.
>> >> 
>> >> 
>> >>> 
>> >>> Thank you for your help,
>> >>> Harry
>> >>> 
>> >>> P.S.
>> >>> The following URLs have been very informative, but they do not answer my
>> >>> question unfortunately.
>> >>> 
>> >>> 
>> >> https://www.redhat.com/en/blog/red-hat-ceph-object-store-dell-emc-servers-part-1
>> >>> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>> >> 
>> >> 
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx