Re: Reconstructing an OSD server when the boot OS is corrupted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Eugen and others for the advice. These are not, however, lvm-based
OSDs. I can get a list of what is out there with:

cephadm ceph-volume raw list

and tried

cephadm ceph-volume raw activate

but it tells me I need to manually run activate.

I was able to find the correct data disks with for example:

ceph-bluestore-tool show-label --dev /dev/sda2

but on running e.g.

cephadm ceph-volume raw activate --osd-id 20 --device /dev/sda --osd-uuid
74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal
/dev/nvme0n1p1

(OSD ID inferred from the list of down OSDs)

I got an error that "systemd support not yet implemented". On adding
--no-systemd to the command, I get the response:

stderr KeyError: 'osd_id'
"
The on-disk metadata indeed doesn't have an osd_id for most entries. For
the one instance I can find with the osd_id key in the metadata, the
"cephadm ceph-volume raw activate" completes but with no apparent change to
the system.

Is there any advice on how to recover the configuration with raw, not LVM,
OSDs?

And then once I have things added back in: the host is currently listed as
offline in the output of "ceph orch host ls". How can it be re-added to
this list?

Thank you,
Peter

BTW full error message:

Inferring fsid ed7b2c16-b053-45e2-a1fe-bf3474f90508
Using ceph image with id '59248721b0c7' and tag 'v17' created on 2024-04-24
16:06:51 +0000 UTC
quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint
/usr/sbin/ceph-volume --privileged --group-add=disk --init -e
CONTAINER_IMAGE=
quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
-e NODE_NAME=ceph-osd3 -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
/var/log/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508:/var/log/ceph:z -v
/dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
/tmp/ceph-tmpjox0_hj0:/etc/ceph/ceph.conf:z
quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
raw activate --osd-id 20 --device /dev/sda --osd-uuid
74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal
/dev/nvme0n1p1 --no-systemd
/usr/bin/docker: stderr Traceback (most recent call last):
/usr/bin/docker: stderr   File "/usr/sbin/ceph-volume", line 11, in <module>
/usr/bin/docker: stderr     load_entry_point('ceph-volume==1.0.0',
'console_scripts', 'ceph-volume')()
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
/usr/bin/docker: stderr     self.main(self.argv)
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in
newfunc
/usr/bin/docker: stderr     return f(*a, **kw)
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
/usr/bin/docker: stderr     terminal.dispatch(self.mapper, subcommand_args)
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in
dispatch
/usr/bin/docker: stderr     instance.main()
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line
32, in main
/usr/bin/docker: stderr     terminal.dispatch(self.mapper, self.argv)
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in
dispatch
/usr/bin/docker: stderr     instance.main()
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py",
line 166, in main
/usr/bin/docker: stderr     systemd=not self.args.no_systemd)
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in
is_root
/usr/bin/docker: stderr     return func(*a, **kw)
/usr/bin/docker: stderr   File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py",
line 79, in activate
/usr/bin/docker: stderr     osd_id = meta['osd_id']
/usr/bin/docker: stderr KeyError: 'osd_id'
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 9679, in <module>
    main()
  File "/usr/sbin/cephadm", line 9667, in main
    r = ctx.func(ctx)
  File "/usr/sbin/cephadm", line 2116, in _infer_config
    return func(ctx)
  File "/usr/sbin/cephadm", line 2061, in _infer_fsid
    return func(ctx)
  File "/usr/sbin/cephadm", line 2144, in _infer_image
    return func(ctx)
  File "/usr/sbin/cephadm", line 2019, in _validate_fsid
    return func(ctx)
  File "/usr/sbin/cephadm", line 6272, in command_ceph_volume
    out, err, code = call_throws(ctx, c.run_cmd(),
verbosity=CallVerbosity.QUIET_UNLESS_ERROR)
  File "/usr/sbin/cephadm", line 1807, in call_throws
    raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint
/usr/sbin/ceph-volume --privileged --group-add=disk --init -e
CONTAINER_IMAGE=
quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
-e NODE_NAME=ceph-osd3 -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
/var/log/ceph/ed7b2c16-b053-45e2-a1fe-bf3474f90508:/var/log/ceph:z -v
/dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
/tmp/ceph-tmpjox0_hj0:/etc/ceph/ceph.conf:z
quay.io/ceph/ceph@sha256:96f2a53bc3028eec16e790c6225e7d7acad8a48737a57ec14eea7ce036733233
raw activate --osd-id 20 --device /dev/sda --osd-uuid
74f4ce9c-4623-41b7-a7f9-cc81bb9467ef --block.db /dev/nvme1n1p1 --block.wal
/dev/nvme0n1p1 --no-systemd






On Wed, 24 Apr 2024 at 14:47, Eugen Block <eblock@xxxxxx> wrote:

> In addition to Nico's response, three years ago I wrote a blog post
> [1] about that topic, maybe that can help as well. It might be a bit
> outdated, what it definitely doesn't contain is this command from the
> docs [2] once the server has been re-added to the host list:
>
> ceph cephadm osd activate <host>
>
> Regards,
> Eugen
>
> [1]
>
> https://heiterbiswolkig.blogs.nde.ag/2021/02/08/cephadm-reusing-osds-on-reinstalled-server/
> [2]
>
> https://docs.ceph.com/en/latest/cephadm/services/osd/#activate-existing-osds
>
> Zitat von Nico Schottelius <nico.schottelius@xxxxxxxxxxx>:
>
> > Hey Peter,
> >
> > the /var/lib/ceph directories mainly contain "meta data" that, depending
> > on the ceph version and osd setup, can even be residing on tmpfs by
> > default.
> >
> > Even if the data was on-disk, they are easy to recreate:
> >
> >
> --------------------------------------------------------------------------------
> > [root@rook-ceph-osd-36-6876cdb479-4764r ceph-36]# ls -l
> > total 28
> > lrwxrwxrwx 1 ceph ceph  8 Feb  7 12:12 block -> /dev/sde
> > -rw------- 1 ceph ceph 37 Feb  7 12:12 ceph_fsid
> > -rw------- 1 ceph ceph 37 Feb  7 12:12 fsid
> > -rw------- 1 ceph ceph 56 Feb  7 12:12 keyring
> > -rw------- 1 ceph ceph  6 Feb  7 12:12 ready
> > -rw------- 1 ceph ceph  3 Feb  7 12:12 require_osd_release
> > -rw------- 1 ceph ceph 10 Feb  7 12:12 type
> > -rw------- 1 ceph ceph  3 Feb  7 12:12 whoami
> > [root@rook-ceph-osd-36-6876cdb479-4764r ceph-36]#
> >
> --------------------------------------------------------------------------------
> >
> > We used to create OSDs manually on alpine linux some years ago using
> > [0], you can check it out as an inspiration for what should be in which
> > file.
> >
> > BR,
> >
> > Nico
> >
> >
> > [0]
> >
> https://code.ungleich.ch/ungleich-public/ungleich-tools/src/branch/master/ceph/ceph-osd-create-start-alpine
> >
> > Peter van Heusden <pvh@xxxxxxxxxxx> writes:
> >
> >> Dear Ceph Community
> >>
> >> We have 5 OSD servers running Ceph v15.2.17. The host operating system
> is
> >> Ubuntu 20.04.
> >>
> >> One of the servers has suffered corruption to its boot operating system.
> >> Using a system rescue disk it is possible to mount the root filesystem
> but
> >> it is not possible to boot the operating system at the moment.
> >>
> >> The OSDs are configured with (spinning disk) data drives, WALs and DBs
> on
> >> partitions of SSDs, but from my examination of the filesystem the
> >> configuration in /var/lib/ceph appears to be corrupted.
> >>
> >> So my question is: what is the best option for repair going forward? Is
> it
> >> possible to do a clean install of the operating system and scan the
> >> existing drives in order to reconstruct the OSD configuration?
> >>
> >> Thank you,
> >> Peter
> >> P.S. the cause of the original corruption is likely due to an unplanned
> >> power outage, an event that hopefully will not recur.
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux