Thank you for the suggestion Frank. We've managed to avoid patches so far, but I guess that convenience ends now :( With # lsblk -P -p -o 'NAME' | wc -l 137 it takes about 10 minutes to run. 70 probably would also bring you over the 2 minute timeout window, so I certainly wouldn't consider updating unless you have this bug patched. Best regards, Mikael On Wed, Apr 5, 2023 at 9:35 AM Frank Schilder <frans@xxxxxx> wrote: > Hi Mikael, thanks for sharing this (see also > https://www.stroustrup.com/whitespace98.pdf, python ha ha ha). We would > probably have observed the same problem (70+ OSDs per host). You might want > to consider configuring deployment against a local registry and use a > patched image. Local container images is always a god idea, post-release > patches are common and not an exception. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Mikael Öhman <micketeer@xxxxxxxxx> > Sent: Wednesday, April 5, 2023 1:18 AM > To: ceph-users@xxxxxxx > Subject: Upgrading to 16.2.11 timing out on ceph-volume due > to raw list performance bug, downgrade isn't possible due to new OP code in > bluestore > > Trying to upgrade a containerized setup from 16.2.10 to 16.2.11 gave us two > big surprises, I wanted to share in case anyone else encounters the same. I > don't see any nice solution to this apart from a new release that fixes the > performance regression that completely breaks the container setup in > cephadm due to timeouts: > > After some digging, we would that the it was the "ceph-volume" command that > kept timing out, and after a ton of digging, found that it does so because > of > > https://github.com/ceph/ceph/commit/bea9f4b643ce32268ad79c0fc257b25ff2f8333c#diff-29697ff230f01df036802c8b2842648267767b3a7231ea04a402eaf4e1819d29R104 > which was introduced into 16.2.11. > Unfortunately, the vital fix for this > > https://github.com/ceph/ceph/commit/8d7423c3e75afbe111c91e699ef3cb1c0beee61b > was not included in 16.2.11 > > So, in a setup like ours, with *many* devices, a simple "ceph-volume raw > list" now takes over 10 minutes to run (instead of 5 seconds in 16.2.10). > As a result, the service files that cephadm generates > > [Service] > LimitNOFILE=1048576 > LimitNPROC=1048576 > EnvironmentFile=-/etc/environment > ExecStart=/bin/bash > /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.run > ExecStop=-/bin/bash -c '/bin/podman stop > ceph-5406fed0-d52b-11ec-beff-7ed30a54847b-%i ; bash > /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.stop' > ExecStopPost=-/bin/bash > /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.poststop > KillMode=none > Restart=on-failure > RestartSec=10s > TimeoutStartSec=120 > TimeoutStopSec=120 > StartLimitInterval=30min > StartLimitBurst=5 > ExecStartPre=-/bin/rm -f %t/%n-pid %t/%n-cid > ExecStopPost=-/bin/rm -f %t/%n-pid %t/%n-cid > Type=forking > PIDFile=%t/%n-pid > Delegate=yes > > will repeatedly be marked as failed, as they take over 2 minutes to run > now. This tells systemd to restart, and we now have an infinite loop, as > the 5 restarts takes over 50 minutes, it never even triggers the > StarLimitInterval, leaving this OSD in an infinite loop over listing the > n^2 devices (which, as a bonus, is also filling up the root disk with an > enormous amount of repeated logging in ceph-volume.log as it infinitely > tries to figure out if a block device is a bluestore) > And trying to just fix the service or unit files manually to at least just > stop this container from being incorrectly restarted over and over, is also > a dead end, since the orchestration stuff just overwrites this > automatically, and restarts the services again. > I found it seemed to be > > /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2 > on my system that generated these files, so i tried tweaking that to have > the necessary 1200 second TimeoutStart and finally that managed to get the > darn container to stop restarting endlessly. (I admit i'm very fuzzy on how > these services and orchestration stuff is triggered as i usually don't work > on our storage stuff) > Still though, it takes 11 minutes to start each OSD service now, so this > isn't great. > > We wanted to revert back to 16.2.10 but it turns out to also be a no-go, as > a new operation added to bluefs https://github.com/ceph/ceph/pull/42750 in > 16.2.11 (though this isn't mentioned in the changelogs, i had to compare > the source code to see that it was in fact added 16.2.11). So trying to > revert an OSD then fails with: > > debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs _replay 0x100000: > stop: unrecognized op 12 > debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs mount failed to > replay log: (5) Input/output error > debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 > bluestore(/var/lib/ceph/osd/ceph-10) _open_bluefs failed bluefs mount: (5) > Input/output error > debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 > bluestore(/var/lib/ceph/osd/ceph-10) _open_db failed to prepare db > environment: > debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 1 bdev(0x5590e80a0400 > /var/lib/ceph/osd/ceph-10/block) close > debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 osd.10 0 OSD:init: > unable to mount object store > debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 ** ERROR: osd init > failed: (5) Input/output error > > Ouch > Best regards, Mikael > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx