I'm going to assume that ALL of your pools are replicated with size 3, since you didn't provide that info, and that all but the *hdd pools are on SSDs. `ceph osd dump | grep pool` Let me know if that isn't the case. With that assumption, I make your pg ratio to be ~ 57, which is way too low. Run `ceph osd df` and look at the next to the last column, that's the number of PG shards/replicates on each OSD. In my example below, there are 145 Check for both HDD and SSD OSDs. [rook@rook-ceph-tools-5ff8d58445-gkl5w ~]$ ceph osd df | tail 433 hdd 7.27739 1.00000 7.3 TiB 344 MiB 250 MiB 0 B 94 MiB 7.3 TiB 0.00 0.04 145 up 462 hdd 7.27739 1.00000 7.3 TiB 344 MiB 250 MiB 0 B 95 MiB 7.3 TiB 0.00 0.04 145 up Current upstream guidance is to target ~100 PG shards/replicas on each OSD, to protect against running out of RAM. Some people including me find larger numbers to be appropriate, depending on the media. ymmv. Free advice and worth every penny. I personally shoot for ~~ 150-200 per HDD OSD, 200-300 for each SSD. I would suggest the below. You don't mention RAM, though. Assuming that your OSDs are all BlueStore, I would suggest at least 64GB on the HDD nodes and 96GB on the SSD nodes *for the OSD processes*. If you have VMs or mons/mgrs or other significant compute on the OSD nodes, they'll naturally need extra. When you increase pg_num for a given pool, you'll increase the RAM that the OSDs hosting that pool use, especially during topology changes or startup. So if your RAM is marginal now, increasing pg_num could lead to oomkilling. Trust me, that's something best avoided. If you still have FileStore OSDs, I strongly suggest redeploying them one at a time as BlueStore. > Additionally, some OSDs fail during the scrubbing process. In such > instances, promptly halting the scrubbing resolves the issue. Have you looked for drive errors? e.g. `dmesg` > I intend to enlarge the PG size for the "one-ssd" configuration. Not a bad idea, see below. > Please provide the PG number, and suggest the optimal approach to increase the PG > size without causing any slowdown or service disruptions to the VMs. With releases beginning with Nautilus, this became a MUCH easier task than it used to be. > > nodeep-scrub flag(s) set > 656 pgs not deep-scrubbed in time If you have osd_scrub_begin_hour / osd_scrub_end_hour (or begin/end day) set, that could contribute to this. IMHO one should be able to scrub around the clock, unless your workload varies considerably throughout the day. Assume that you aren't setting thouse, you might want to double the setting of osd_deep_scrub_interval. The default is 7 days, I might suggest 14 for your HDD OSDs, assuming those are the ones that are not being scrubbed in time. > > osd: 107 osds: 105 up (since 3h), 105 in (since 3d) You have two OSDs down btw. > data: > pools: 13 pools, 2057 pgs > objects: 9.40M objects, 35 TiB > usage: 106 TiB used, 154 TiB / 259 TiB avail > pgs: 2057 active+clean > > root@ceph1:~# ceph df > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 151 TiB 78 TiB 72 TiB 73 TiB 48.04 > ssd 110 TiB 78 TiB 32 TiB 32 TiB 29.42 > TOTAL 261 TiB 156 TiB 104 TiB 105 TiB 40.19 One often aims for the percentages of the total PG count for each pool to be relative to the percentage of the cluster's data that each holds. https://old.ceph.com/pgcalc/ ^ this isn't working for me the the moment Right now your RGW buckets.data pool is using half the space of the one-ssd pool, but it has twice the PGs. So you'd want to invoke: ceph osd pool set one-ssd pg_num 512 If you want to gauge the impact, you might set to, say, 260 first and see how it goes, but most likely you could set it to 512 and the cluster will gradually step it up a few at a time. During that time you'll see some data moving around. It is ideal for each pool to end up with pg_num set to a power of 2 but you can have in-between numbers along the way for a short time. If I make correct assumptions and calculations above, your HDD OSDs have a PG ratio of ~~74. If you were to bump each of the HDD pools from 512 to 1024 PGs, your ratio would end up roughly at 150, which IMHO is desirable -- assuming you have RAM. But address your SSD pools first, they're suffering more. If I make correct assumptions and calculations above, your SSD OSDs have a PG ratio of ~~48 which is waaay too low. I suggest setting one-ssd to 512 which should result in your ration on those OSDs rising to ~~ 64. If that holds true given my interpretation of what you supply, I might then set the index and log pools to 64 -- you want those to be >= the number of OSDs they're on, which is not currently true. Then choose whichever of buckets.data and one-ssd is used more heavily, and set it to 1024. Let the cluster settle, check the PG ratios again and I think you'll be around ~~100 on the SSD OSDs, at which point I might then set the other SSD pool to 1024. Do these one at a time, in however many increments you're comfortable with. > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > cephfs_data 1 64 3.8 KiB 0 11 KiB 0 23 TiB > cephfs_metadata 2 8 228 MiB 79 685 MiB 0 23 TiB > .rgw.root 3 32 6.0 KiB 8 1.5 MiB 0 23 TiB > default.rgw.control 4 32 0 B 8 0 B 0 23 TiB > default.rgw.meta 5 32 12 KiB 48 7.5 MiB 0 23 TiB > default.rgw.log 6 32 4.8 KiB 207 6.0 MiB 0 23 TiB > default.rgw.buckets.index 7 32 410 MiB 15 1.2 GiB 0 23 TiB > default.rgw.buckets.data 8 512 4.6 TiB 1.29M 14 TiB 16.59 23 TiB > default.rgw.buckets.non-ec 9 32 1.0 MiB 676 130 MiB 0 23 TiB > one-hdd 10 512 9.2 TiB 2.45M 28 TiB 28.69 23 TiB > device_health_metrics 11 1 9.5 MiB 113 28 MiB 0 23 TiB > one-ssd 12 256 11 TiB 2.88M 32 TiB 31.37 23 TiB > cloudstack.hdd 15 512 10 TiB 2.72M 31 TiB 30.94 23 TiB > > > > Regards > Mosharaf Hossain > Manager, Product Development > IT Division > > Bangladesh Export Import Company Ltd. > > Level-8, SAM Tower, Plot #4, Road #22, Gulshan-1, Dhaka-1212,Bangladesh > > Tel: +880 9609 000 999, +880 2 5881 5559, Ext: 14191, Fax: +880 2 9895757 > > Cell: +8801787680828, Email: mosharaf.hossain@xxxxxxxxxxxxxx, Web: > www.bol-online.com > <https://www.google.com/url?q=http://www.bol-online.com&sa=D&source=hangouts&ust=1557908951423000&usg=AFQjCNGMxIuHSHsD3qO6y5JddpEZ0S592A> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx