Hi, If I may, I would try something like this but I haven't tested this so please take this with a grain of salt... 1.I would reinstall the Operating System in this case... Since the root filesystem is accessible but the OS is not bootable, the most straightforward approach would be to perform a clean install of Ubuntu 20.04 on the corrupted server. Make sure to not format the data drives used by Ceph during this process (the OSDs here). 2. Then reinstall Ceph: After reinstalling the operating system, you'll need to reinstall Ceph. Ensure that the version you install is compatible with the existing cluster (v15.2.17 as per your setup). 3. Reattach OSDs: - Using ceph-bluestore-tool show-label to identify the correct devices is a good start. You've successfully identified the devices, which is crucial in thie case. - You attempted to use cephadm ceph-volume raw activate but encountered issues due to missing osd_id in metadata and systemd not being implemented for this command in your environment. This suggests that there might be an inconsistency or corruption in the metadata of the OSDs but I am not sure... maybe a dev could help here. Handling osd_id KeyError: - The error KeyError 'osd_id' I think indicates that the metadata required to map the OSD ID to the device is missing or something (maybe corruption?). If possible, check the output of "ceph-bluestore-tool show-label --dev /dev/your_disk_drive_here" and other devices to verify if osd_id is present in any metadata there. If it's consistently missing, there might be a need to recreate or repopulate this metadata (I don't know how to do this part) - For OSDs where osd_id is available, try activating them individually using the "cephadm ceph-volume raw activate" command with the *--no-systemd* flag. 4. Re-adding the Host to the Cluster if this appears offline If the host appears as offline, once the system is operational and the Ceph services are running, you can add the host back to the cluster using "ceph orch host add <hostname>" command - this stept I am not sure it is needed. 5. Make sure the hostname matches what the cluster expects, and the networking is correctly configured - similar as the old config if possible. Just my 2 cents. Thank you, Bogdan Velica croit.io On Tue, Apr 30, 2024 at 10:38 PM Peter van Heusden <pvh@xxxxxxxxxxx> wrote: > Thanks Eugen and others for the advice. These are not, however, lvm-based > OSDs. I can get a list of what is out there with: > > cephadm ceph-volume raw list > > and tried > > cephadm ceph-volume raw activate > > but it tells me I need to manually run activate. > > I was able to find the correct data disks with for example: > > ceph-bluestore-tool show-label --dev /dev/sda2 > > but on running e.g. > > cephadm ceph-volume raw activate --osd-id 20 --device /dev/sda --osd-uuid > 74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal > /dev/nvme0n1p1 > > (OSD ID inferred from the list of down OSDs) > > I got an error that "systemd support not yet implemented". On adding > --no-systemd to the command, I get the response: > > stderr KeyError: 'osd_id' > " > The on-disk metadata indeed doesn't have an osd_id for most entries. For > the one instance I can find with the osd_id key in the metadata, the > "cephadm ceph-volume raw activate" completes but with no apparent change to > the system. > > Is there any advice on how to recover the configuration with raw, not LVM, > OSDs? > > And then once I have things added back in: the host is currently listed as > offline in the output of "ceph orch host ls". How can it be re-added to > this list? > > Thank you, > Peter > > BTW full error message: > > Inferring fsid ed7b2c16-b053-45e2-a1fe-bf3474f90508 > Using ceph image with id '59248721b0c7' and tag 'v17' created on 2024-04-24 > 16:06:51 +0000 UTC > > quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233 > Non-zero > <http://quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233Non-zero> > exit code 1 from /usr/bin/docker run --rm --ipc=host > --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint > /usr/sbin/ceph-volume --privileged --group-add=disk --init -e > CONTAINER_IMAGE= > > quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233 > -e > <http://quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233-e> > NODE_NAME=ceph-osd3 -e CEPH_USE_RANDOM_NONCE=1 -e > CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v > /var/log/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508:/var/log/ceph:z -v > /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v > /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v > /tmp/ceph-tmpjox0_hj0:/etc/ceph/ceph.conf:z > > quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233 > raw activate --osd-id 20 --device /dev/sda --osd-uuid > 74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal > /dev/nvme0n1p1 --no-systemd > /usr/bin/docker: stderr Traceback (most recent call last): > /usr/bin/docker: stderr File "/usr/sbin/ceph-volume", line 11, in > <module> > /usr/bin/docker: stderr load_entry_point('ceph-volume==1.0.0', > 'console_scripts', 'ceph-volume')() > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in > __init__ > /usr/bin/docker: stderr self.main(self.argv) > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in > newfunc > /usr/bin/docker: stderr return f(*a, **kw) > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main > /usr/bin/docker: stderr terminal.dispatch(self.mapper, subcommand_args) > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in > dispatch > /usr/bin/docker: stderr instance.main() > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line > 32, in main > /usr/bin/docker: stderr terminal.dispatch(self.mapper, self.argv) > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in > dispatch > /usr/bin/docker: stderr instance.main() > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", > line 166, in main > /usr/bin/docker: stderr systemd=not self.args.no_systemd) > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in > is_root > /usr/bin/docker: stderr return func(*a, **kw) > /usr/bin/docker: stderr File > "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", > line 79, in activate > /usr/bin/docker: stderr osd_id = meta['osd_id'] > /usr/bin/docker: stderr KeyError: 'osd_id' > Traceback (most recent call last): > File "/usr/sbin/cephadm", line 9679, in <module> > main() > File "/usr/sbin/cephadm", line 9667, in main > r = ctx.func(ctx) > File "/usr/sbin/cephadm", line 2116, in _infer_config > return func(ctx) > File "/usr/sbin/cephadm", line 2061, in _infer_fsid > return func(ctx) > File "/usr/sbin/cephadm", line 2144, in _infer_image > return func(ctx) > File "/usr/sbin/cephadm", line 2019, in _validate_fsid > return func(ctx) > File "/usr/sbin/cephadm", line 6272, in command_ceph_volume > out, err, code = call_throws(ctx, c.run_cmd(), > verbosity=CallVerbosity.QUIET_UNLESS_ERROR) > File "/usr/sbin/cephadm", line 1807, in call_throws > raise RuntimeError('Failed command: %s' % ' '.join(command)) > RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host > --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint > /usr/sbin/ceph-volume --privileged --group-add=disk --init -e > CONTAINER_IMAGE= > > quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233 > -e > <http://quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233-e> > NODE_NAME=ceph-osd3 -e CEPH_USE_RANDOM_NONCE=1 -e > CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v > /var/log/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508:/var/log/ceph:z -v > /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v > /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v > /tmp/ceph-tmpjox0_hj0:/etc/ceph/ceph.conf:z > > quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233 > raw activate --osd-id 20 --device /dev/sda --osd-uuid > 74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal > /dev/nvme0n1p1 --no-systemd > > > > > > > On Wed, 24 Apr 2024 at 14:47, Eugen Block <eblock@xxxxxx> wrote: > > > In addition to Nico's response, three years ago I wrote a blog post > > [1] about that topic, maybe that can help as well. It might be a bit > > outdated, what it definitely doesn't contain is this command from the > > docs [2] once the server has been re-added to the host list: > > > > ceph cephadm osd activate <host> > > > > Regards, > > Eugen > > > > [1] > > > > > https://heiterbiswolkig.blogs.nde.ag/2021/02/08/cephadm-reusing-osds-on-reinstalled-server/ > > [2] > > > > > https://docs.ceph.com/en/latest/cephadm/services/osd/#activate-existing-osds > > > > Zitat von Nico Schottelius <nico.schottelius@xxxxxxxxxxx>: > > > > > Hey Peter, > > > > > > the /var/lib/ceph directories mainly contain "meta data" that, > depending > > > on the ceph version and osd setup, can even be residing on tmpfs by > > > default. > > > > > > Even if the data was on-disk, they are easy to recreate: > > > > > > > > > -------------------------------------------------------------------------------- > > > [root@rook-ceph-osd-36-6876cdb479-4764r ceph-36]# ls -l > > > total 28 > > > lrwxrwxrwx 1 ceph ceph 8 Feb 7 12:12 block -> /dev/sde > > > -rw------- 1 ceph ceph 37 Feb 7 12:12 ceph_fsid > > > -rw------- 1 ceph ceph 37 Feb 7 12:12 fsid > > > -rw------- 1 ceph ceph 56 Feb 7 12:12 keyring > > > -rw------- 1 ceph ceph 6 Feb 7 12:12 ready > > > -rw------- 1 ceph ceph 3 Feb 7 12:12 require_osd_release > > > -rw------- 1 ceph ceph 10 Feb 7 12:12 type > > > -rw------- 1 ceph ceph 3 Feb 7 12:12 whoami > > > [root@rook-ceph-osd-36-6876cdb479-4764r ceph-36]# > > > > > > -------------------------------------------------------------------------------- > > > > > > We used to create OSDs manually on alpine linux some years ago using > > > [0], you can check it out as an inspiration for what should be in which > > > file. > > > > > > BR, > > > > > > Nico > > > > > > > > > [0] > > > > > > https://code.ungleich.ch/ungleich-public/ungleich-tools/src/branch/master/ceph/ceph-osd-create-start-alpine > > > > > > Peter van Heusden <pvh@xxxxxxxxxxx> writes: > > > > > >> Dear Ceph Community > > >> > > >> We have 5 OSD servers running Ceph v15.2.17. The host operating system > > is > > >> Ubuntu 20.04. > > >> > > >> One of the servers has suffered corruption to its boot operating > system. > > >> Using a system rescue disk it is possible to mount the root filesystem > > but > > >> it is not possible to boot the operating system at the moment. > > >> > > >> The OSDs are configured with (spinning disk) data drives, WALs and DBs > > on > > >> partitions of SSDs, but from my examination of the filesystem the > > >> configuration in /var/lib/ceph appears to be corrupted. > > >> > > >> So my question is: what is the best option for repair going forward? > Is > > it > > >> possible to do a clean install of the operating system and scan the > > >> existing drives in order to reconstruct the OSD configuration? > > >> > > >> Thank you, > > >> Peter > > >> P.S. the cause of the original corruption is likely due to an > unplanned > > >> power outage, an event that hopefully will not recur. > > >> _______________________________________________ > > >> ceph-users mailing list -- ceph-users@xxxxxxx > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx