Re: Upgrading to 16.2.11 timing out on ceph-volume due to raw list performance bug, downgrade isn't possible due to new OP code in bluestore

Frank Schilder <frans@xxxxxx> · Wed, 5 Apr 2023 07:35:50 +0000

Hi Mikael, thanks for sharing this (see also https://www.stroustrup.com/whitespace98.pdf, python ha ha ha). We would probably have observed the same problem (70+ OSDs per host). You might want to consider configuring deployment against a local registry and use a patched image. Local container images is always a god idea, post-release patches are common and not an exception.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mikael Öhman <micketeer@xxxxxxxxx>
Sent: Wednesday, April 5, 2023 1:18 AM
To: ceph-users@xxxxxxx
Subject:  Upgrading to 16.2.11 timing out on ceph-volume due to raw list performance bug, downgrade isn't possible due to new OP code in bluestore

Trying to upgrade a containerized setup from 16.2.10 to 16.2.11 gave us two
big surprises, I wanted to share in case anyone else encounters the same. I
don't see any nice solution to this apart from a new release that fixes the
performance regression that completely breaks the container setup in
cephadm due to timeouts:

After some digging, we would that the it was the "ceph-volume" command that
kept timing out, and after a ton of digging, found that it does so because
of
https://github.com/ceph/ceph/commit/bea9f4b643ce32268ad79c0fc257b25ff2f8333c#diff-29697ff230f01df036802c8b2842648267767b3a7231ea04a402eaf4e1819d29R104
which was introduced into 16.2.11.
Unfortunately, the vital fix for this
https://github.com/ceph/ceph/commit/8d7423c3e75afbe111c91e699ef3cb1c0beee61b
was not included in 16.2.11

So, in a setup like ours, with *many* devices, a simple "ceph-volume raw
list" now takes over 10 minutes to run (instead of 5 seconds in 16.2.10).
As a result, the service files that cephadm generates

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.run
ExecStop=-/bin/bash -c '/bin/podman stop
ceph-5406fed0-d52b-11ec-beff-7ed30a54847b-%i ; bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.stop'
ExecStopPost=-/bin/bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
ExecStartPre=-/bin/rm -f %t/%n-pid %t/%n-cid
ExecStopPost=-/bin/rm -f %t/%n-pid %t/%n-cid
Type=forking
PIDFile=%t/%n-pid
Delegate=yes

will repeatedly be marked as failed, as they take over 2 minutes to run
now. This tells systemd to restart, and we now have an infinite loop, as
the 5 restarts takes over 50 minutes, it never even triggers the
StarLimitInterval, leaving this OSD in an infinite loop over listing the
n^2 devices (which, as a bonus, is also filling up the root  disk with an
enormous amount of repeated logging in ceph-volume.log as it infinitely
tries to figure out if a block device is a bluestore)
And trying to just fix the service or unit files manually to at least just
stop this container from being incorrectly restarted over and over, is also
a dead end, since the orchestration stuff just overwrites this
automatically, and restarts the services again.
I found it seemed to be
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2
on my system that generated these files, so i tried tweaking that to have
the necessary 1200 second TimeoutStart and finally that managed to get the
darn container to stop restarting endlessly. (I admit i'm very fuzzy on how
these services and orchestration stuff is triggered as i usually don't work
on our storage stuff)
Still though, it takes 11 minutes to start each OSD service now, so this
isn't great.

We wanted to revert back to 16.2.10 but it turns out to also be a no-go, as
a new operation added to bluefs https://github.com/ceph/ceph/pull/42750 in
16.2.11 (though this isn't mentioned in the changelogs, i had to compare
the source code to see that it was in fact added 16.2.11). So trying to
revert an OSD then fails with:

debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs _replay 0x100000:
stop: unrecognized op 12
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs mount failed to
replay log: (5) Input/output error
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1
bluestore(/var/lib/ceph/osd/ceph-10) _open_bluefs failed bluefs mount: (5)
Input/output error
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1
bluestore(/var/lib/ceph/osd/ceph-10) _open_db failed to prepare db
environment:
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200  1 bdev(0x5590e80a0400
/var/lib/ceph/osd/ceph-10/block) close
debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 osd.10 0 OSD:init:
unable to mount object store
debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1  ** ERROR: osd init
failed: (5) Input/output error

Ouch
Best regards, Mikael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx