Hi Victor, out of curiosity, did you upgrade the cluster recently to octopus? We and others observed this behaviour when following one out of two routes to upgrade OSDs. There was a thread "Octopus OSDs extremely slow during upgrade from mimic", which seems to have been lost with the recent mail list outage. If it is relevant, I could copy pieces I have into here. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Victor Rodriguez <vrodriguez@xxxxxxxxxxxxx> Sent: 29 January 2023 22:40:46 To: ceph-users@xxxxxxx Subject: Re: Very slow snaptrim operations blocking client I/O Looks like this is going to take a few days. I hope to manage the available performance for VMs with osd_snap_trim_sleep_ssd. I'm wondering if after that long snaptrim process you went through, was your cluster was stable again and snapshots/snaptrims did work properly? On 1/29/23 16:01, Matt Vandermeulen wrote: > I should have explicitly stated that during the recovery, it was still > quite bumpy for customers. Some snaptrims were very quick, some took > what felt like a really long time. This was however a cluster with a > very large number of volumes and a long, long history of snapshots. > I'm not sure what the difference will be from our case versus a single > large volume with a big snapshot. > > > > On 2023-01-28 20:45, Victor Rodriguez wrote: >> On 1/29/23 00:50, Matt Vandermeulen wrote: >>> I've observed a similar horror when upgrading a cluster from >>> Luminous to Nautilus, which had the same effect of an overwhelming >>> amount of snaptrim making the cluster unusable. >>> >>> In our case, we held its hand by setting all OSDs to have zero max >>> trimming PGs, unsetting nosnaptrim, and then slowly enabling >>> snaptrim a few OSDs at a time. It was painful to babysit but it >>> allowed the cluster to catch up without falling over. >> >> >> That's an interesting approach! Thanks! >> >> On preliminary tests seems that just running snaptrim on a single PG >> of a single OSD still makes the cluster barely usable. I have to >> increase osd_snap_trim_sleep_ssd to ~1 so the cluster remains usable >> by getting a third of its performance. After a while, a few PG got >> trimmed and feels like some of them are harder to trim than others, >> as some need a higher osd_snap_trim_sleep_ssd value to let the >> cluster perform. >> >> I don't know how long this is going to take... Maybe recreating the >> OSD's and dealing with the rebalance is a better option? >> >> There's something ugly going on here... I would really like to put my >> finger on it. >> >> >>> On 2023-01-28 19:43, Victor Rodriguez wrote: >>>> After some investigation this is what I'm seeing: >>>> >>>> - OSD processes get stuck at least at 100% CPU if I ceph osd unset >>>> nosnaptrim. They keep at 100% CPU even if I ceph osd set >>>> nosnaptrim. They stayed like that for at least 26 hours. Some quick >>>> benchmarks don't show a reduction of the performance of the cluster. >>>> >>>> - Restarting a OSD lowers it's CPU usage to typical levels, as >>>> expected, but it also usually sets some other OSD in a different >>>> host to typical levels. >>>> >>>> - All OSDs in this cluster take quite a bit to start: between 35 to >>>> 70 seconds depending on the OSD. Clearly much longer than any other >>>> OSD in any of my clusters. >>>> >>>> - I believe that the size of the rocksdb database is dumped in the >>>> OSD log when an automatic compact operation is triggered. The "sum" >>>> sizes of these OSD range between 2.5 and 5.1 GB. Thats way bigger >>>> that those in any other cluster I have. >>>> >>>> - ceph daemon osd.* calc_objectstore_db_histogram is giving values >>>> for num_pgmeta_omap (I don't know what it is) way bigger than those >>>> on any other of my clusters for some OSD. Also, values are not >>>> similar among the OSD which hold the same PGs. >>>> >>>> osd.0: "num_pgmeta_omap": 17526766, >>>> osd.1: "num_pgmeta_omap": 2653379, >>>> osd.2: "num_pgmeta_omap": 12358703, >>>> osd.3: "num_pgmeta_omap": 6404975, >>>> osd.6: "num_pgmeta_omap": 19845318, >>>> osd.7: "num_pgmeta_omap": 6043083, >>>> osd.12: "num_pgmeta_omap": 18666776, >>>> osd.13: "num_pgmeta_omap": 615846, >>>> osd.14: "num_pgmeta_omap": 13190188, >>>> >>>> - Compacting the OSD barely reduces rocksdb size and does not >>>> reduce num_pgmeta_omap at all. >>>> >>>> - This is the only cluster I have were there are some RBD images >>>> that I mount directly from some clients, that is, they are not >>>> disks for QEMU/Proxmox VMs. Maybe I have something misconfigured >>>> related to this? This cluster is at least two and half years old >>>> an never had this issue with snaptrims. >>>> >>>> Thanks in advance! >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx