Hi Mikael, thanks for sharing this (see also https://www.stroustrup.com/whitespace98.pdf, python ha ha ha). We would probably have observed the same problem (70+ OSDs per host). You might want to consider configuring deployment against a local registry and use a patched image. Local container images is always a god idea, post-release patches are common and not an exception. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Mikael Öhman <micketeer@xxxxxxxxx> Sent: Wednesday, April 5, 2023 1:18 AM To: ceph-users@xxxxxxx Subject: Upgrading to 16.2.11 timing out on ceph-volume due to raw list performance bug, downgrade isn't possible due to new OP code in bluestore Trying to upgrade a containerized setup from 16.2.10 to 16.2.11 gave us two big surprises, I wanted to share in case anyone else encounters the same. I don't see any nice solution to this apart from a new release that fixes the performance regression that completely breaks the container setup in cephadm due to timeouts: After some digging, we would that the it was the "ceph-volume" command that kept timing out, and after a ton of digging, found that it does so because of https://github.com/ceph/ceph/commit/bea9f4b643ce32268ad79c0fc257b25ff2f8333c#diff-29697ff230f01df036802c8b2842648267767b3a7231ea04a402eaf4e1819d29R104 which was introduced into 16.2.11. Unfortunately, the vital fix for this https://github.com/ceph/ceph/commit/8d7423c3e75afbe111c91e699ef3cb1c0beee61b was not included in 16.2.11 So, in a setup like ours, with *many* devices, a simple "ceph-volume raw list" now takes over 10 minutes to run (instead of 5 seconds in 16.2.10). As a result, the service files that cephadm generates [Service] LimitNOFILE=1048576 LimitNPROC=1048576 EnvironmentFile=-/etc/environment ExecStart=/bin/bash /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.run ExecStop=-/bin/bash -c '/bin/podman stop ceph-5406fed0-d52b-11ec-beff-7ed30a54847b-%i ; bash /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.stop' ExecStopPost=-/bin/bash /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.poststop KillMode=none Restart=on-failure RestartSec=10s TimeoutStartSec=120 TimeoutStopSec=120 StartLimitInterval=30min StartLimitBurst=5 ExecStartPre=-/bin/rm -f %t/%n-pid %t/%n-cid ExecStopPost=-/bin/rm -f %t/%n-pid %t/%n-cid Type=forking PIDFile=%t/%n-pid Delegate=yes will repeatedly be marked as failed, as they take over 2 minutes to run now. This tells systemd to restart, and we now have an infinite loop, as the 5 restarts takes over 50 minutes, it never even triggers the StarLimitInterval, leaving this OSD in an infinite loop over listing the n^2 devices (which, as a bonus, is also filling up the root disk with an enormous amount of repeated logging in ceph-volume.log as it infinitely tries to figure out if a block device is a bluestore) And trying to just fix the service or unit files manually to at least just stop this container from being incorrectly restarted over and over, is also a dead end, since the orchestration stuff just overwrites this automatically, and restarts the services again. I found it seemed to be /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2 on my system that generated these files, so i tried tweaking that to have the necessary 1200 second TimeoutStart and finally that managed to get the darn container to stop restarting endlessly. (I admit i'm very fuzzy on how these services and orchestration stuff is triggered as i usually don't work on our storage stuff) Still though, it takes 11 minutes to start each OSD service now, so this isn't great. We wanted to revert back to 16.2.10 but it turns out to also be a no-go, as a new operation added to bluefs https://github.com/ceph/ceph/pull/42750 in 16.2.11 (though this isn't mentioned in the changelogs, i had to compare the source code to see that it was in fact added 16.2.11). So trying to revert an OSD then fails with: debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs _replay 0x100000: stop: unrecognized op 12 debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs mount failed to replay log: (5) Input/output error debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_bluefs failed bluefs mount: (5) Input/output error debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db failed to prepare db environment: debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 1 bdev(0x5590e80a0400 /var/lib/ceph/osd/ceph-10/block) close debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 osd.10 0 OSD:init: unable to mount object store debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 ** ERROR: osd init failed: (5) Input/output error Ouch Best regards, Mikael _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx