Re: Reconstructing an OSD server when the boot OS is corrupted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

If I may, I would try something like this but I haven't tested this so
please take this with a grain of salt...

1.I would reinstall the Operating System in this case...
Since the root filesystem is accessible but the OS is not bootable, the
most straightforward approach would be to perform a clean install of Ubuntu
20.04 on the corrupted server. Make sure to not format the data drives used
by Ceph during this process (the OSDs here).

2. Then reinstall Ceph:
After reinstalling the operating system, you'll need to reinstall Ceph.
Ensure that the version you install is compatible with the existing cluster
(v15.2.17 as per your setup).

3. Reattach OSDs:
- Using ceph-bluestore-tool show-label to identify the correct devices is a
good start. You've successfully identified the devices, which is crucial in
thie case.
- You attempted to use cephadm ceph-volume raw activate but encountered
issues due to missing osd_id in metadata and systemd not being implemented
for this command in your environment.
This suggests that there might be an inconsistency or corruption in the
metadata of the OSDs but I am not sure... maybe a dev could help here.

Handling osd_id KeyError:
- The error KeyError 'osd_id' I think indicates that the metadata required
to map the OSD ID to the device is missing or something (maybe
corruption?). If possible, check the output of "ceph-bluestore-tool
show-label --dev /dev/your_disk_drive_here" and other devices to verify if
osd_id is present in any metadata there. If it's consistently missing,
there might be a need to recreate or repopulate this metadata (I don't know
how to do this part)
- For OSDs where osd_id is available, try activating them individually
using the "cephadm ceph-volume raw activate" command with the *--no-systemd*
flag.

4. Re-adding the Host to the Cluster if this appears offline
If the host appears as offline, once the system is operational and the Ceph
services are running, you can add the host back to the cluster using "ceph
orch host add <hostname>" command - this stept I am not sure it is needed.

5. Make sure the hostname matches what the cluster expects, and the
networking is correctly configured - similar as the old config if possible.

Just my 2 cents.

Thank you,
Bogdan Velica
croit.io

On Tue, Apr 30, 2024 at 10:38 PM Peter van Heusden <pvh@xxxxxxxxxxx> wrote:

> Thanks Eugen and others for the advice. These are not, however, lvm-based
> OSDs. I can get a list of what is out there with:
>
> cephadm ceph-volume raw list
>
> and tried
>
> cephadm ceph-volume raw activate
>
> but it tells me I need to manually run activate.
>
> I was able to find the correct data disks with for example:
>
> ceph-bluestore-tool show-label --dev /dev/sda2
>
> but on running e.g.
>
> cephadm ceph-volume raw activate --osd-id 20 --device /dev/sda --osd-uuid
> 74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal
> /dev/nvme0n1p1
>
> (OSD ID inferred from the list of down OSDs)
>
> I got an error that "systemd support not yet implemented". On adding
> --no-systemd to the command, I get the response:
>
> stderr KeyError: 'osd_id'
> "
> The on-disk metadata indeed doesn't have an osd_id for most entries. For
> the one instance I can find with the osd_id key in the metadata, the
> "cephadm ceph-volume raw activate" completes but with no apparent change to
> the system.
>
> Is there any advice on how to recover the configuration with raw, not LVM,
> OSDs?
>
> And then once I have things added back in: the host is currently listed as
> offline in the output of "ceph orch host ls". How can it be re-added to
> this list?
>
> Thank you,
> Peter
>
> BTW full error message:
>
> Inferring fsid ed7b2c16-b053-45e2-a1fe-bf3474f90508
> Using ceph image with id '59248721b0c7' and tag 'v17' created on 2024-04-24
> 16:06:51 +0000 UTC
>
> quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
> Non-zero
> <http://quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233Non-zero>
> exit code 1 from /usr/bin/docker run --rm --ipc=host
> --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint
> /usr/sbin/ceph-volume --privileged --group-add=disk --init -e
> CONTAINER_IMAGE=
>
> quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
> -e
> <http://quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233-e>
> NODE_NAME=ceph-osd3 -e CEPH_USE_RANDOM_NONCE=1 -e
> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
> /var/log/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508:/var/log/ceph:z -v
> /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
> /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
> /tmp/ceph-tmpjox0_hj0:/etc/ceph/ceph.conf:z
>
> quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
> raw activate --osd-id 20 --device /dev/sda --osd-uuid
> 74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal
> /dev/nvme0n1p1 --no-systemd
> /usr/bin/docker: stderr Traceback (most recent call last):
> /usr/bin/docker: stderr   File "/usr/sbin/ceph-volume", line 11, in
> <module>
> /usr/bin/docker: stderr     load_entry_point('ceph-volume==1.0.0',
> 'console_scripts', 'ceph-volume')()
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in
> __init__
> /usr/bin/docker: stderr     self.main(self.argv)
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in
> newfunc
> /usr/bin/docker: stderr     return f(*a, **kw)
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
> /usr/bin/docker: stderr     terminal.dispatch(self.mapper, subcommand_args)
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in
> dispatch
> /usr/bin/docker: stderr     instance.main()
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line
> 32, in main
> /usr/bin/docker: stderr     terminal.dispatch(self.mapper, self.argv)
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in
> dispatch
> /usr/bin/docker: stderr     instance.main()
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py",
> line 166, in main
> /usr/bin/docker: stderr     systemd=not self.args.no_systemd)
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in
> is_root
> /usr/bin/docker: stderr     return func(*a, **kw)
> /usr/bin/docker: stderr   File
> "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py",
> line 79, in activate
> /usr/bin/docker: stderr     osd_id = meta['osd_id']
> /usr/bin/docker: stderr KeyError: 'osd_id'
> Traceback (most recent call last):
>   File "/usr/sbin/cephadm", line 9679, in <module>
>     main()
>   File "/usr/sbin/cephadm", line 9667, in main
>     r = ctx.func(ctx)
>   File "/usr/sbin/cephadm", line 2116, in _infer_config
>     return func(ctx)
>   File "/usr/sbin/cephadm", line 2061, in _infer_fsid
>     return func(ctx)
>   File "/usr/sbin/cephadm", line 2144, in _infer_image
>     return func(ctx)
>   File "/usr/sbin/cephadm", line 2019, in _validate_fsid
>     return func(ctx)
>   File "/usr/sbin/cephadm", line 6272, in command_ceph_volume
>     out, err, code = call_throws(ctx, c.run_cmd(),
> verbosity=CallVerbosity.QUIET_UNLESS_ERROR)
>   File "/usr/sbin/cephadm", line 1807, in call_throws
>     raise RuntimeError('Failed command: %s' % ' '.join(command))
> RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host
> --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint
> /usr/sbin/ceph-volume --privileged --group-add=disk --init -e
> CONTAINER_IMAGE=
>
> quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
> -e
> <http://quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233-e>
> NODE_NAME=ceph-osd3 -e CEPH_USE_RANDOM_NONCE=1 -e
> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
> /var/log/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508:/var/log/ceph:z -v
> /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
> /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
> /tmp/ceph-tmpjox0_hj0:/etc/ceph/ceph.conf:z
>
> quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
> raw activate --osd-id 20 --device /dev/sda --osd-uuid
> 74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal
> /dev/nvme0n1p1 --no-systemd
>
>
>
>
>
>
> On Wed, 24 Apr 2024 at 14:47, Eugen Block <eblock@xxxxxx> wrote:
>
> > In addition to Nico's response, three years ago I wrote a blog post
> > [1] about that topic, maybe that can help as well. It might be a bit
> > outdated, what it definitely doesn't contain is this command from the
> > docs [2] once the server has been re-added to the host list:
> >
> > ceph cephadm osd activate <host>
> >
> > Regards,
> > Eugen
> >
> > [1]
> >
> >
> https://heiterbiswolkig.blogs.nde.ag/2021/02/08/cephadm-reusing-osds-on-reinstalled-server/
> > [2]
> >
> >
> https://docs.ceph.com/en/latest/cephadm/services/osd/#activate-existing-osds
> >
> > Zitat von Nico Schottelius <nico.schottelius@xxxxxxxxxxx>:
> >
> > > Hey Peter,
> > >
> > > the /var/lib/ceph directories mainly contain "meta data" that,
> depending
> > > on the ceph version and osd setup, can even be residing on tmpfs by
> > > default.
> > >
> > > Even if the data was on-disk, they are easy to recreate:
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > [root@rook-ceph-osd-36-6876cdb479-4764r ceph-36]# ls -l
> > > total 28
> > > lrwxrwxrwx 1 ceph ceph  8 Feb  7 12:12 block -> /dev/sde
> > > -rw------- 1 ceph ceph 37 Feb  7 12:12 ceph_fsid
> > > -rw------- 1 ceph ceph 37 Feb  7 12:12 fsid
> > > -rw------- 1 ceph ceph 56 Feb  7 12:12 keyring
> > > -rw------- 1 ceph ceph  6 Feb  7 12:12 ready
> > > -rw------- 1 ceph ceph  3 Feb  7 12:12 require_osd_release
> > > -rw------- 1 ceph ceph 10 Feb  7 12:12 type
> > > -rw------- 1 ceph ceph  3 Feb  7 12:12 whoami
> > > [root@rook-ceph-osd-36-6876cdb479-4764r ceph-36]#
> > >
> >
> --------------------------------------------------------------------------------
> > >
> > > We used to create OSDs manually on alpine linux some years ago using
> > > [0], you can check it out as an inspiration for what should be in which
> > > file.
> > >
> > > BR,
> > >
> > > Nico
> > >
> > >
> > > [0]
> > >
> >
> https://code.ungleich.ch/ungleich-public/ungleich-tools/src/branch/master/ceph/ceph-osd-create-start-alpine
> > >
> > > Peter van Heusden <pvh@xxxxxxxxxxx> writes:
> > >
> > >> Dear Ceph Community
> > >>
> > >> We have 5 OSD servers running Ceph v15.2.17. The host operating system
> > is
> > >> Ubuntu 20.04.
> > >>
> > >> One of the servers has suffered corruption to its boot operating
> system.
> > >> Using a system rescue disk it is possible to mount the root filesystem
> > but
> > >> it is not possible to boot the operating system at the moment.
> > >>
> > >> The OSDs are configured with (spinning disk) data drives, WALs and DBs
> > on
> > >> partitions of SSDs, but from my examination of the filesystem the
> > >> configuration in /var/lib/ceph appears to be corrupted.
> > >>
> > >> So my question is: what is the best option for repair going forward?
> Is
> > it
> > >> possible to do a clean install of the operating system and scan the
> > >> existing drives in order to reconstruct the OSD configuration?
> > >>
> > >> Thank you,
> > >> Peter
> > >> P.S. the cause of the original corruption is likely due to an
> unplanned
> > >> power outage, an event that hopefully will not recur.
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux