Hi all, after some investigation and experiments, I found the culprit: snap trim is completely trashing performance after OSD daemons are upgraded to the new version and before the OSDs are converted (bluestore OMAP conversion). There was a number of users reporting performance problems with snap trim active. It might be a good idea to check if some unconverted OSDs are still in the cluster. After the OMAP conversion snap trim behaves normal again and performance is as expected. For anyone doing periodic snapshot removal and planning to upgrade from mimic to octopus, here is an extended procedure to avoid the problems we experienced. Amendments are to https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus. All time estimates assume ceph-fs data with a large amount of small files (a really bad case). Step 2: # ceph osd set noout # ceph osd set nosnaptrim Step 5: Not optional. Before restarting any OSD services: # ceph config set osd bluestore_fsck_quick_fix_on_mount false Step 9: Do NOT unset nosnaptrim. This flag will need to remain set until the completion of OMAP conversion. During the conversion period (which can be days to weeks depending on cluster size) it is not possible to trim snapshots. Either one lives with the accumulation of deleted snaps or adjusts any rotating snapshot procedure for the conversion window. Step 13: Check if any OSDs are affected by the pg dup problem # ceph tell "osd.*" perf dump | grep -e pglog -e "osd\\." If any osd_pglog_items>>1M check https://www.clyso.com/blog/osds-with-unlimited-ram-growth/. It is/seems important that these dup entries are trimmed before OMAP conversion. Otherwise OSDs might end up in an OOM kill loop. On our OSDs we had about 800K items and OSDs stayed within or close to their allocated memory_target. Step 14: OMAP conversion. Our OSDs had about 5M object shards each and were 70% full on average. The disks are 300G 15K RPM SAS drives. On these drives, OMAP conversion took approximately 1h. As far as I can tell, the implemented method is simple: go through every PG, go through every object and create new rocks DBs, I believe per PG. With the huge amount of small objects we have on our OSDs, IOP/s is the limiting factor. Our timings should give a good indication for how long it takes on other drives with different specs. # ceph osd set noout # # ceph osd set nodown # optional, it can help if things are flaky # ceph config set osd bluestore_fsck_quick_fix_on_mount true Now restart OSDs host by host (and allow for rebuild in between). In our case the OSDs used about 33% of a core each (on old Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz) and the disks were reasonably utilised. Once the last OSD is converted, the cluster immediately operates much better. Step 15: # # ceph osd unset nodown # ceph osd unset noout # ceph osd unset nosnaptrim Expected timing example: For our production cluster I expect at least a 12 day conversion period. Our OSDs are 7K RPM NLSAS drives and given the same distribution of file sizes as in the test cluster, I expect a minimum of 9h per drive based on: 15M object shards per OSD and 3 times lower performance gives 9x1h. This is not too bad, our cluster should be able to recover until the next day. Just for info: our biggest OSD nodes have 72 OSDs, we have 12 of these and I really hope I can convert a complete host at a time without OOM kills. 12 days is still quite something for a software upgrade given all the restrictions and vulnerabilities that come with it and I really hope that in the future, if something like this is required again, an on-line-approach is implemented that allows to perform such a conversion under load and also that all relevant operations are considered for performance evaluation and testing so that production clusters don't just die on something as ordinary as snap trim. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Bailey Allison <ballison@xxxxxxxxxxxx> Sent: 05 September 2022 21:41 To: Frank Schilder; ceph-users@xxxxxxx Subject: RE: Re: Octopus OSDs extremely slow during upgrade from mimic Hey Frank, Did you upgrade directly from mimic to octopus? There is a change on the OSDs when upgrading to Octopus to how the OSDs account for OMAP data, we have seen doing upgrades to Octopus especially where there is a large amount of OMAP data stored on the OSDs it can take quite a long time for the OSDs to come back to normal function, it would take about a few hours or so. Typically we would upgrade a node of OSDs at a time, wait for them to finish, and then move onto the next one. You can read more on the Ceph Docs here: https://docs.ceph.com/en/octopus/releases/octopus/#instructions >Note that the first time each OSD starts, it will do a format conversion to improve the accounting for “omap” data. This may take a few minutes to as much as a few >hours (for an HDD with lots of omap data). If you're to check the systemd status of one of these OSDs, or the logs, you should see lots of mentions about this conversion I believe. Regards, Bailey _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx