Hi, we've encountered the same issue after upgrading to octopus on on of our rbd cluster, and now it reappears after the autoscaler lowered the PGs form 8k to 2k for the RBD pool. What we've done in the past: - recreate all OSD after our 2nd incident with slow OPS in a single week after the ceph upgrade (earlier september) - upgraded the OS from centos7 to ubuntu focal after the third incident (december) - offline compact all OSDs a week ago, because we had some (~500) very old snapshots lying around and hoped that the snaptrim works faster when the rockzdb is freshly compacted. After the first incident we had roughly three months of smooth sailing, and now it is again roughly three months of smooth sailing later, and we experience these slow OPS again. It might be this time, because we have some OSDs (2TB SSD) with very few PGs (~30) and some OSDs (8TB SSD) with a lot of PGs (~120). I will try to compact all OSDs and check if it stops again, but I think I need to bump the PGs up again to 4K PGs, because it started again when the autoscaler lowered the PGs. And from out data (prometheus), the apply latency goes up to 9seconds and it mostly hits the 8TB disks. I am currently cunning the "time ceph daemon osd.1 calc_objectstore_db_histogram" for all OSDs and get very mixed values, but non of the values lies in the <1 minute range. Am Mi., 15. Feb. 2023 um 16:42 Uhr schrieb Victor Rodriguez < vrodriguez@xxxxxxxxxxxxx>: > An update on this for the record: > > To fully solve this I've had to destroy each OSD and create then again, > one by one. I could have done it one host at a time but I've preferred > to be on the safest side just in case something else went wrong. > > The values for num_pgmeta_omap (which I don't know what it is, yet) on > the new OSD were similar to other clusters (I've seen from 300000 to > 700000 aprox), so I believe the characteristics of the data in the > cluster does not determine how big or small num_pgmeta_omap should be. > > One thing I've noticed is that /bad/ or /damanged/ OSDs (i.e. those > showing high CPU usage and poor performance doing the trim operation) > took much more time to calculate their histogram, even if their > num_pgmeta_opmap was low: > > (/bad //OSD)/: > # time ceph daemon osd.1 calc_objectstore_db_histogram | grep > "num_pgmeta_omap" > "num_pgmeta_omap": 673208, > > real 1m14,549s > user 0m0,075s > sys 0m0,025s > > (/good new OSD/): > # time ceph daemon osd.1 calc_objectstore_db_histogram | grep > "num_pgmeta_omap" > "num_pgmeta_omap": 434298, > > real 0m18,022s > user 0m0,078s > sys 0m0,023s > > > Maybe is worth checking that histogram from time to time as a way to > measure the OSD "health"? > > Again, thanks everyone. > > > > On 1/30/23 18:18, Victor Rodriguez wrote: > > > > On 1/30/23 15:15, Ana Aviles wrote: > >> Hi, > >> > >> Josh already suggested, but I will one more time. We had similar > >> behaviour upgrading from Nautilus to Pacific. In our case compacting > >> the OSDs did the trick. > > > > Thanks for chimming in! Unfortunately, in my case neither an online > > compaction (ceph tell osd.ID compact) or an offline repair > > (ceph-bluestore-tool repair --path /var/lib/ceph/osd/OSD_ID) does > > help. Compactions seem to compact some amount. I think that OSD log > > dumps information about the size of rocksdb. It went from this: > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > > L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 3.9 3.9 > > 0.0 1.0 0.0 62.2 64.81 61.59 89 0.728 0 0 > > L1 3/0 132.84 MB 0.5 7.0 3.9 3.1 5.0 > > 2.0 0.0 1.3 63.8 46.1 112.11 108.52 23 > > 4.874 56M 7276K > > L2 12/0 690.99 MB 0.8 6.5 1.8 4.7 5.6 > > 0.9 0.1 3.2 21.4 18.5 310.78 307.14 28 > > 11.099 165M 3077K > > L3 54/0 3.37 GB 0.1 0.9 0.3 0.6 0.5 > > -0.1 0.0 1.6 35.9 20.2 24.84 24.49 4 > > 6.210 24M 15M > > Sum 69/0 4.17 GB 0.0 14.4 6.0 8.3 15.1 > > 6.7 0.1 3.8 28.7 30.1 512.54 501.74 144 > > 3.559 246M 26M > > Int 0/0 0.00 KB 0.0 0.8 0.3 0.5 0.6 0.1 > > 0.0 14.1 27.5 20.7 31.13 30.73 4 7.783 18M 4086K > > > > To this: > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > > L0 2/0 72.42 MB 0.5 0.0 0.0 0.0 0.1 0.1 > > 0.0 1.0 0.0 63.2 1.14 0.84 2 0.572 0 0 > > L3 48/0 3.10 GB 0.1 0.0 0.0 0.0 0.0 0.0 > > 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 > > Sum 50/0 3.17 GB 0.0 0.0 0.0 0.0 0.1 0.1 > > 0.0 1.0 0.0 63.2 1.14 0.84 2 0.572 0 0 > > Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 > > 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 > > > > Still, it feels "too big" compared to some other OSD in other > > similarily sized clusters, making me think that there's some kind of > > "garbage" making the trim to go crazy. > > > > > >> For us there was no performance impact running the compaction (ceph > >> osd daemon osd.0 compact) although we run them in batches and not all > >> at once on all OSDs just in case. Also, no need to restart OSDs for > >> this operation. > > > > Yes, compacting had no perceived impact on client performance, just > > some higher CPU usage for the OSD process. > > > > > > Does anyone knows by any chance the meaning of "num_pgmeta_omap" on > > ceph daemon osd.ID calc_objectstore_db_histogram output? As I > > mentioned, the OSDs in this cluster have very different values in that > > field but all other clusters have much similar values: > > > > osd.0: "num_pgmeta_omap": 17526766, > > osd.1: "num_pgmeta_omap": 2653379, > > osd.2: "num_pgmeta_omap": 12358703, > > osd.3: "num_pgmeta_omap": 6404975, > > osd.6: "num_pgmeta_omap": 19845318, > > osd.7: "num_pgmeta_omap": 6043083, > > osd.12: "num_pgmeta_omap": 18666776, > > osd.13: "num_pgmeta_omap": 615846, > > osd.14: "num_pgmeta_omap": 13190188, > > > > Thanks a lot! > > > > > > > -- > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx