Re: OSD down, how to reconstruct it from its main and block.db parts ?

Wladimir Mutel <mwg@xxxxxxxxx> · Fri, 30 Oct 2020 20:32:11 +0200

Dear Mr. Caro,

here are the next steps I decided to take on my discretion :

1. running "ceph-volume lvm list" , I discovered the following section about my non-activating osd.1 :
====== osd.1 =======
  [block]       /dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202
      block device              /dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202
      block uuid                Pe7IJO-TXMq-wcAt-0ytX-WuPA-6XoH-fo5wUb
      cephx lockbox secret
      cluster fsid              49cdfe90-6f6e-4afe-8558-bf14a13aadfa
      cluster name              ceph
      crush device class        None
      db device                 /dev/nvme0n1p4
      db uuid                   1d07e91a-0bef-434d-a23c-0df76025a726
      encrypted                 0
      osd fsid                  8c6324a3-0364-4fad-9dcb-81a1661ee202
      osd id                    1
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdd1
  [db]          /dev/nvme0n1p4
      PARTUUID                  1d07e91a-0bef-434d-a23c-0df76025a726

2. running "ceph-volume lvm activate --bluestore 1 8c6324a3-0364-4fad-9dcb-81a1661ee202" , I got the following printout :
Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev 
/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202 --path /var/lib/ceph/osd/ceph-1 --no-mon-config
Running command: /bin/ln -snf /dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202 /var/lib/ceph/osd/ceph-1/block
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
Running command: /bin/chown -R ceph:ceph /dev/dm-1
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Running command: /bin/ln -snf /dev/nvme0n1p4 /var/lib/ceph/osd/ceph-1/block.db
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p4
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block.db
Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p4
Running command: /bin/systemctl enable ceph-volume@lvm-1-8c6324a3-0364-4fad-9dcb-81a1661ee202
 stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-8c6324a3-0364-4fad-9dcb-81a1661ee202.service → 
/lib/systemd/system/ceph-volume@.service.
Running command: /bin/systemctl enable --runtime ceph-osd@1
 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service → /lib/systemd/system/ceph-osd@.service.
Running command: /bin/systemctl start ceph-osd@1
--> ceph-volume lvm activate successful for osd ID: 1

3. after that I happily noticed these records in my "journalctl -b" :
жов 30 20:00:13 p10s systemd[1]: Starting Ceph object storage daemon osd.1...
жов 30 20:00:13 p10s systemd[1]: Started Ceph object storage daemon osd.1.

4. and then, very soon, I saw a storm of journal messages about pg parts found on osd.1 , ending with words like "transitioning to Stray" or "transitioning to 
Primary". After that I was happy to see this :
жов 30 20:10:14 p10s ceph-mon[729]: mon.p10s mon.0 196 : osd.1 [v2:192.168.200.230:6858/4036,v1:192.168.200.230:6859/4036] boot
жов 30 20:10:14 p10s ceph-mgr[728]: 2020-10-30T20:10:14.722+0200 7fd915577700  0 [progress WARNING root] osd.1 marked in
жов 30 20:10:14 p10s ceph-osd[4036]: 2020-10-30T20:10:14.730+0200 7fa3ce974700  1 osd.1 464042 state: booting -> active
жов 30 20:10:14 p10s ceph-mgr[728]: 2020-10-30T20:10:14.738+0200 7fd915577700  0 [progress WARNING root] 0 PGs affected by osd.1 being marked in

5. And now it is even shown in my "ceph osd df" as being up and containing data !

So, in the conclusion, my guess is that this problem was caused by a missing startup symlink in /etc/systemd/system/multi-user.target.wants/
But why it was not created with the initial run of "ceph-volume lvm prepare", is still a mystery to me ...

David Caro wrote:

Hi Wladimir, according to the logs you first sent it seems that there is an
authentication issue (the osd daemon not being able to fetch the mon config):

жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300
7f513cebedc0 -1 AuthRegistry(0x7fff46ea5d80) no keyring found at
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: failed to fetch mon config
(--no-mon-config to skip)
жов 23 16:59:36 p10s systemd[1]: ceph-osd@1.service: Main process
exited, code=exited, status=1/FAILURE

The file it fails to load the keyring from is where the auth details for the
osd daemon should be in.
Some more info here:
   https://docs.ceph.com/en/latest/man/8/ceph-authtool/
   https://docs.ceph.com/en/latest/rados/configuration/auth-config-ref/
   https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/
   (specifically step 5)

I'm not sure if you were able to fix it or not, but I'd start trying to get
that fixed before playing with ceph-volume.

On 10/27 10:24, Wladimir Mutel wrote:
Dear David,

I assimilated most of my Ceph configuration into the cluster itself as this feature was announced by Mimic.
I see some fsid in [global] section of /etc/ceph/ceph.conf , and some key in [client.admin] section of /etc/ceph/ceph.client.admin.keyring
The rest is pretty uninteresting, some minimal adjustments in config file and cluster's config dump.

Looking into Python scripts of ceph-volume, I noticed that tmpfs is mounted during the run "ceph-colume lvm activate",
and "ceph-bluestore-tool prime-osd-dir" is started from the same script afterwards.
Should I try starting "ceph-volume lvm activate" in some manual way to see where it stumbles and why ?

David Caro wrote:
Hi Wladim,

If the "unable to find keyring" message disappeared, what was the error after that fix?

If it's still failing to fetch the mon config, check your authentication (you might have to add the osd key to the keyring again), and/or that the mons ips are correct in your osd ceph.conf file.

On 23 October 2020 16:08:02 CEST, Wladimir Mutel <mwg@xxxxxxxxx> wrote:
Dear all,

after breaking my experimental 1-host Ceph cluster and making one its
pg 'incomplete' I left it in abandoned state for some time.
Now I decided to bring it back into life and found that it can not
start one of its OSDs (osd.1 to name it)

"ceph osd df" shows :

ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META
AVAIL    %USE   VAR   PGS  STATUS
0    hdd        0   1.00000  2.7 TiB  1.6 TiB  1.6 TiB  113 MiB  4.7
GiB  1.1 TiB  59.77  0.69  102      up
1    hdd  2.84549         0      0 B      0 B      0 B      0 B      0
B      0 B      0     0    0    down
2    hdd  2.84549   1.00000  2.8 TiB  2.6 TiB  2.5 TiB   57 MiB  3.8
GiB  275 GiB  90.58  1.05  176      up
3    hdd  2.84549   1.00000  2.8 TiB  2.6 TiB  2.5 TiB   57 MiB  3.9
GiB  271 GiB  90.69  1.05  185      up
4    hdd  2.84549   1.00000  2.8 TiB  2.6 TiB  2.5 TiB   63 MiB  4.2
GiB  263 GiB  90.98  1.05  184      up
5    hdd  2.84549   1.00000  2.8 TiB  2.6 TiB  2.5 TiB   52 MiB  3.8
GiB  263 GiB  90.96  1.05  178      up
6    hdd  2.53400   1.00000  2.5 TiB  2.3 TiB  2.3 TiB  173 MiB  5.2
GiB  228 GiB  91.21  1.05  178      up
7    hdd  2.53400   1.00000  2.5 TiB  2.3 TiB  2.3 TiB  147 MiB  5.2
GiB  230 GiB  91.12  1.05  168      up
      TOTAL   19 TiB   17 TiB   16 TiB  662 MiB   31 GiB  2.6 TiB  86.48
MIN/MAX VAR: 0.69/1.05  STDDEV: 10.90

"ceph device ls" shows :

DEVICE                                      HOST:DEV      DAEMONS
                  LIFE EXPECTANCY
GIGABYTE_GP-ASACNE2100TTTDR_SN191108950380  p10s:nvme0n1  osd.1 osd.2
osd.3 osd.4 osd.5
WDC_WD30EFRX-68N32N0_WD-WCC7K1JJXVST        p10s:sdd      osd.1
WDC_WD30EFRX-68N32N0_WD-WCC7K1VUYPRA        p10s:sda      osd.6
WDC_WD30EFRX-68N32N0_WD-WCC7K2CKX8NT        p10s:sdb      osd.7
WDC_WD30EFRX-68N32N0_WD-WCC7K2UD8H74        p10s:sde      osd.2
WDC_WD30EFRX-68N32N0_WD-WCC7K2VFTR1F        p10s:sdh      osd.5
WDC_WD30EFRX-68N32N0_WD-WCC7K3CYKL87        p10s:sdf      osd.3
WDC_WD30EFRX-68N32N0_WD-WCC7K6FPZAJP        p10s:sdc      osd.0
WDC_WD30EFRX-68N32N0_WD-WCC7K7FXSCRN        p10s:sdg      osd.4

In my last migration, I created a bluestore volume with external
block.db like this :

"ceph-volume lvm prepare --bluestore --data /dev/sdd1 --block.db
/dev/nvme0n1p4"

And I can see this metadata by

"ceph-bluestore-tool show-label --dev
/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202"
:

{
"/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202":
{
          "osd_uuid": "8c6324a3-0364-4fad-9dcb-81a1661ee202",
          "size": 3000588304384,
          "btime": "2020-07-12T11:34:16.579735+0300",
          "description": "main",
          "bfm_blocks": "45785344",
          "bfm_blocks_per_key": "128",
          "bfm_bytes_per_block": "65536",
          "bfm_size": "3000588304384",
          "bluefs": "1",
          "ceph_fsid": "49cdfe90-6f6e-4afe-8558-bf14a13aadfa",
          "kv_backend": "rocksdb",
          "magic": "ceph osd volume v026",
          "mkfs_done": "yes",
          "osd_key": "AQD9ygpf+7+MABAAqtj4y1YYgxwCaAN/jgDSwg==",
          "ready": "ready",
          "require_osd_release": "14",
          "whoami": "1"
      }
}

and by

"ceph-bluestore-tool show-label --dev /dev/nvme0n1p4" :

{
      "/dev/nvme0n1p4": {
          "osd_uuid": "8c6324a3-0364-4fad-9dcb-81a1661ee202",
          "size": 128025886720,
          "btime": "2020-07-12T11:34:16.592054+0300",
          "description": "bluefs db"
      }
}

As you see, their osd_uuid is equal.
But when I try to start it by hand : "systemctl restart ceph-osd@1" ,
I get this in the logs : ("journalctl -b -u ceph-osd@1")

-- Logs begin at Tue 2020-10-13 19:09:49 EEST, end at Fri 2020-10-23
16:59:38 EEST. --
жов 23 16:59:36 p10s systemd[1]: Starting Ceph object storage daemon
osd.1...
жов 23 16:59:36 p10s systemd[1]: Started Ceph object storage daemon
osd.1.
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300
7f513cebedc0 -1 auth: unable to find a keyring on
/var/lib/ceph/osd/ceph-1/keyring: (2) No
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300
7f513cebedc0 -1 auth: unable to find a keyring on
/var/lib/ceph/osd/ceph-1/keyring: (2) No
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300
7f513cebedc0 -1 AuthRegistry(0x560776222940) no keyring found at
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300
7f513cebedc0 -1 AuthRegistry(0x560776222940) no keyring found at
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300
7f513cebedc0 -1 auth: unable to find a keyring on
/var/lib/ceph/osd/ceph-1/keyring: (2) No
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300
7f513cebedc0 -1 auth: unable to find a keyring on
/var/lib/ceph/osd/ceph-1/keyring: (2) No
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300
7f513cebedc0 -1 AuthRegistry(0x7fff46ea5d80) no keyring found at
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300
7f513cebedc0 -1 AuthRegistry(0x7fff46ea5d80) no keyring found at
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: failed to fetch mon config
(--no-mon-config to skip)
жов 23 16:59:36 p10s systemd[1]: ceph-osd@1.service: Main process
exited, code=exited, status=1/FAILURE
жов 23 16:59:36 p10s systemd[1]: ceph-osd@1.service: Failed with result
'exit-code'.

And so my question is, how to make this OSD known again to Ceph cluster
without recreating it anew with ceph-volume ?
I see that every folder under "/var/lib/ceph/osd/" is a tmpfs mount
point filled with appropriate files and symlinks, except of
"/var/lib/ceph/osd/ceph-1",
which is just an empty folder not mounted anywhere.
I tried to run

"ceph-bluestore-tool prime-osd-dir --dev
/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202
--path
/var/lib/ceph/osd/ceph-1"

it created some files under /var/lib/ceph/osd/ceph-1 but without tmpfs
mount, and these files belonged to root. I changed owner of these files
into ceph.ceph ,
I created appropriate symlinks for block and block.db but ceph-osd@1
did not want to start either. Only "unable to find keyring" messages
disappeared.

Please give any help on where to move next.
Thanks in advance for your help.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx