> Not surprising for HDDs. Double your deep-scrub interval. Done! > So you’re relying on the SSD DB device for the index pool? Have you looked at your logs / metrics for those OSDs to see if there is any spillover? > What type of SSD are you using here? And how many HDD OSDs do you have using each? I will try to describe the system as best I can, We are talking about 18 different hosts. Each host has a large number of HDDs, and a small number of SSDs (4), Out of these SSDs, 2 are used as the backend, to a high speed volume-ssd pool, that certain VMs write into, and the other 2 are split into very large LVM partitions, which act as the journal for the HDDs, So in config terms the system looks like this: "data_devices=/dev/ceph-hdd01/osd-hdd01,db_devices=/dev/ceph-nvme0n1/ceph-nvme0n1-lv01" I have amended the gist to add that extra information from lsblk. I have not added any information regarding disk models etc. But from the top of my head, each HDD should be about 16T in size, and the NVME is also extremely large and built for high-I/O systems. Each db_devices, if you see in the lsblk, is extremely large so I think there is no spillover. > Uggh. If the index pool is entirely on HDDs, with no SSD DB partition, then yeah any metadata ops are going to be dog slow. Check that your OSDs actually do have external SSD DBs — it’s easy over the OSD lifecycle to deploy that way > > initially but to inadvertently rebuild OSDs without the external device. I will investigate and I will start by planning a new pg bump which takes forever due to the size of the cluster for the volumes pool, AND somehow move the index pool to an osd device before bumping. All this is excellent advice which I thank you for. I would like now to ask your opinion on the original query, Do you think that there is some palpable difference between 1 bucket with 10 million objects, and 10 buckets with 1 million objects each? Intuitively, I feel that the first case would mean interacting with far fewer pgs than the second (10 times less?) which spreads the load on more devices, but my knowledge of ceph internals is nearly 0. Regards, Harry On Tue, Oct 15, 2024 at 4:26 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: > > > > On Oct 15, 2024, at 9:28 AM, Harry Kominos <hkominos@xxxxxxxxx> wrote: > > > > Hello Anthony and thank you for your response! > > > > I have placed the requested info in a separate gist here: > > https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885 > > > 3826 pgs not deep-scrubbed in time > > 1501 pgs not scrubbed in time > > Not surprising for HDDs. Double your deep-scrub interval. > > > Every OSD is an HDD, with their corresponding index, on a partition in an > > SSD device. > > > So you’re relying on the SSD DB device for the index pool? Have you > looked at your logs / metrics for those OSDs to see if there is any > spillover? > > What type of SSD are you using here? And how many HDD OSDs do you have > using each? > > > > And we are talking about 18 separate devices, with separate > > cluster_network for the rebalancing etc. > > > 18 separate devices? Do you mean 18 OSDs per server? 18 servers? Or the > fact that you’re using 18TB HDDs? > > > The index for the RGW is also on an HDD (for now). > > Uggh. If the index pool is entirely on HDDs, with no SSD DB partition, > then yeah any metadata ops are going to be dog slow. Check that your OSDs > actually do have external SSD DBs — it’s easy over the OSD lifecycle to > deploy that way initially but to inadvertently rebuild OSDs without the > external device. > > > Now as far as the number of pgs is concerned, I reached that number, > > through one of the calculators that are found online. > > You’re using the autoscaler, I see. > > In your `ceph osd df` output, look at the PGS column at right. Your > balancer seems to be working fairly well. Your average number of PG > replicas per OSD is around 71, which is in alignment with upstream > guidance. > > But I would suggest going twice as high. See the very recent thread about > PGs. So I would adjust pg_num on pools in accordance with their usage and > needs so that the PGS column there ends up in the 150 - 200 range. > > > Since the cluster is doing Object store, Filesystem and Block storage, > each pool has a different > > number for pg_num. > > In the RGW Data case, the pool has about 300TB in it , so perhaps that > > explains that the pg_num is lower than what you expected ? > > Ah, mixed cluster. You shoulda led with that ;) > > default.rgw.buckets.data 356.7T 3.0 16440T 0.0651 1.0 4096 off False > default.rgw.buckets.index 5693M 3.0 16440T 0.0000 1.0 32 on False > default.rgw.buckets.non-ec 62769k 3.0 418.7T 0.0000 1.0 32 > volumes 8 16384 2.4 PiB 650.08M 7.2 PiB 53.80 2.1 PiB > > You have three pools with appreciable data — the two RBD pools and your > bucket pool. Your pg_nums are more or less reflective of that, which is > general guidance. > > But the index pool is not about data or objects stored. The index pool is > mainly omaps not RADOS objects, and needs to be resourced differently. > Assuming that all 978 OSDs are identical media? Your `ceph df` output > though implies that you have OSDs on SSDs, so I’ll again request info on > the media and how your OSDs are built. > > > Your index pool has only 32 PGs. I suggest setting pg_num for that pool > to, say, 1024. It’ll take a while to split those PGs and you’ll see > pgp_num slowly increasing, but when it’s done I strongly suspect that > you’ll have better results. > > The non-ec pool is mainly AIUI used for multipart uploads. If your S3 > objects are 4MB in size it probably doesn’t matter. If you do start using > MPU you’ll want to increase pg_num there too. > > > > > > Regards, > > Harry > > > > > > > > On Tue, Oct 15, 2024 at 2:54 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> > > wrote: > > > >> > >> > >>> Hello Ceph Community! > >>> > >>> I have the following very interesting problem, for which I found no > clear > >>> guidelines upstream so I am hoping to get some input from the mailing > >> list. > >>> I have a 6PB cluster in operation which is currently half full. The > >> cluster > >>> has around 1K OSD, and the RGW data pool has 4096 pgs (and pgp_num). > >> > >> Even without specifics I can tell you that pg_num is waaaaaaaaaaaaaay > too > >> low. > >> > >> Please send > >> > >> `ceph -s` > >> `ceph osd tree | head -30` > >> `ceph osd df | head -10` > >> `ceph -v` > >> > >> Also, tell us what media your index and bucket OSDs are on. > >> > >>> The issue is as follows: > >>> Let's say that we have 10 million small objects (4MB) each. > >> > >> In RGW terms, those are large objects. Small objects would be 4KB. > >> > >>> 1)Is there a performance difference *when fetching* between storing all > >> 10 > >>> million objects in one bucket and storing 1 million in 10 buckets? > >> > >> Larger buckets will generally be slower for some things, but if you’re > on > >> Reef, and your bucket wasn’t created on an older release, 10 million > >> shouldn’t be too bad. Listing larger buckets will always be > increasingly > >> slower. > >> > >>> There > >>> should be "some" because of the different number of pgs in use, in the > 2 > >>> scenarios but it is very hard to quantify. > >>> > >>> 2) What if I have 100 million objects? Is there some theoretical limit > / > >>> guideline on the number of objects that I should have in a bucket > before > >> I > >>> see performance drops? > >> > >> At that point, you might consider indexless buckets, if your > >> client/application can keep track of objects in its own DB. > >> > >> With dynamic sharding (assuming you have it enabled), RGW defaults to > >> 100,000 objects per shard and 1999 max shards, so I *think* that after > 199M > >> objects in a bucket it won’t auto-reshard. > >> > >>> I should mention here that the contents of the bucket *never need to be > >>> listed, *The user always knows how to do a curl, to get the contents. > >> > >> We can most likely improve your config, but you may also be a candidate > >> for an indexless bucket. They don’t get a lot of press, and I won’t > claim > >> to be expert in them, but it’s something to look into. > >> > >> > >>> > >>> Thank you for your help, > >>> Harry > >>> > >>> P.S. > >>> The following URLs have been very informative, but they do not answer > my > >>> question unfortunately. > >>> > >>> > >> > https://www.redhat.com/en/blog/red-hat-ceph-object-store-dell-emc-servers-part-1 > >>> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx