Re: osd out cant' bring it back online

Oliver Weinmann <oliver.weinmann@xxxxxx> · Tue, 1 Dec 2020 13:19:52 +0100

Yes, I deployed via cephadm on CentOS 7, it is using podman. The 
container doesn't even start up so I don't get a container id. But i 
checked journalctl -xe, and it seems that its trying to use a container 
name that still exists.

-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has 
begun starting up.
Dec 01 11:39:29 gedaopl02 podman[9976]: Error: no container with name or 
ID ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0 found: no such container
Dec 01 11:39:29 gedaopl02 systemd[1]: Started Ceph osd.0 for 
d0920c36-2368-11eb-a5de-005056b703af.
-- Subject: Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service 
has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has 
finished starting up.
--
-- The start-up result is done.
Dec 01 11:39:29 gedaopl02 bash[9993]: WARNING: The same type, major and 
minor should not be used for multiple devices.
Dec 01 11:39:29 gedaopl02 bash[9993]: Error: error creating container 
storage: the container name 
"ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0-activate" is already in 
use by 
"e43f8533d6418267d7e6f3a408a566b4221df4fb51b13d71c634ee697914bad6". You 
have to remove that container to be able to reuse that name.: that name 
is already in use
Dec 01 11:39:29 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service: main process 
exited, code=exited, status=125/n/a
Dec 01 11:39:29 gedaopl02 bash[10033]: WARNING: The same type, major and 
minor should not be used for multiple devices.
Dec 01 11:39:29 gedaopl02 bash[10033]: Error: error creating container 
storage: the container name 
"ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0-deactivate" is already 
in use by 
"ef696c5a92ea891cbd7651cdab66abe6c4ba49b70ef06e44b51c9be1cdfc36d9". You 
have to remove that container to be able to reuse that name.: that name 
is already in use
Dec 01 11:39:29 gedaopl02 systemd[1]: Unit 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service entered failed 
state.
Dec 01 11:39:29 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service failed.
Dec 01 11:39:39 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service holdoff time 
over, scheduling restart.
Dec 01 11:39:39 gedaopl02 systemd[1]: Stopped Ceph osd.0 for 
d0920c36-2368-11eb-a5de-005056b703af.
-- Subject: Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service 
has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has 
finished shutting down.
Dec 01 11:39:39 gedaopl02 systemd[1]: Starting Ceph osd.0 for 
d0920c36-2368-11eb-a5de-005056b703af...
-- Subject: Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service 
has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has 
begun starting up.
Dec 01 11:39:39 gedaopl02 podman[10134]: Error: no container with name 
or ID ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0 found: no such 
container
Dec 01 11:39:39 gedaopl02 systemd[1]: Started Ceph osd.0 for 
d0920c36-2368-11eb-a5de-005056b703af.
-- Subject: Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service 
has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has 
finished starting up.
--
-- The start-up result is done.
Dec 01 11:39:40 gedaopl02 bash[10150]: WARNING: The same type, major and 
minor should not be used for multiple devices.
Dec 01 11:39:40 gedaopl02 bash[10150]: Error: error creating container 
storage: the container name 
"ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0-activate" is already in 
use by 
"e43f8533d6418267d7e6f3a408a566b4221df4fb51b13d71c634ee697914bad6". You 
have to remove that container to be able to reuse that name.: that name 
is already in use
Dec 01 11:39:40 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service: main process 
exited, code=exited, status=125/n/a
Dec 01 11:39:40 gedaopl02 bash[10175]: WARNING: The same type, major and 
minor should not be used for multiple devices.
Dec 01 11:39:40 gedaopl02 bash[10175]: Error: error creating container 
storage: the container name 
"ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0-deactivate" is already 
in use by 
"ef696c5a92ea891cbd7651cdab66abe6c4ba49b70ef06e44b51c9be1cdfc36d9". You 
have to remove that container to be able to reuse that name.: that name 
is already in use
Dec 01 11:39:40 gedaopl02 systemd[1]: Unit 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service entered failed 
state.
Dec 01 11:39:40 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service failed.
Dec 01 11:39:50 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service holdoff time 
over, scheduling restart.
Dec 01 11:39:50 gedaopl02 systemd[1]: Stopped Ceph osd.0 for 
d0920c36-2368-11eb-a5de-005056b703af.
-- Subject: Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service 
has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has 
finished shutting down.
Dec 01 11:39:50 gedaopl02 systemd[1]: start request repeated too quickly 
for ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service
Dec 01 11:39:50 gedaopl02 systemd[1]: Failed to start Ceph osd.0 for 
d0920c36-2368-11eb-a5de-005056b703af.
-- Subject: Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service 
has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service has failed.
--
-- The result is failed.
Dec 01 11:39:50 gedaopl02 systemd[1]: Unit 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service entered failed 
state.
Dec 01 11:39:50 gedaopl02 systemd[1]: 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service failed.
Dec 01 11:39:59 gedaopl02 postfix/smtpd[10257]: connect from 
localhost[127.0.0.1]
Dec 01 11:39:59 gedaopl02 postfix/smtpd[10257]: disconnect from 
localhost[127.0.0.1]
Dec 01 11:40:00 gedaopl02 sshd[10264]: rexec line 32: Deprecated option 
ServerKeyBits
Dec 01 11:40:00 gedaopl02 sshd[10264]: error: Could not load host key: 
/etc/ssh/ssh_host_dsa_key
Dec 01 11:40:00 gedaopl02 sshd[10264]: Connection closed by 127.0.0.1 
port 52624 [preauth]

podman ps -a didn't show that container. So I googled and stumbled over 
this post:

https://github.com/containers/podman/issues/2553

I was able to fix it by running:

podman rm --storage 
e43f8533d6418267d7e6f3a408a566b4221df4fb51b13d71c634ee697914bad6

After that I reset the failure of the service and started it again.

systemctl reset-failed 
ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service
systemctl start ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service

Now ceph is doing its magic :)

[root@gedasvl02 ~]# ceph -s
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config 
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
  cluster:
    id:     d0920c36-2368-11eb-a5de-005056b703af
    health: HEALTH_WARN
            Degraded data redundancy: 1941/39432 objects degraded 
(4.922%), 19 pgs degraded, 19 pgs undersized
            8 pgs not deep-scrubbed in time

  services:
    mon: 1 daemons, quorum gedasvl02 (age 2w)
    mgr: gedasvl02.vqswxg(active, since 2w), standbys: gedaopl02.yrwzqh
    mds: cephfs:1 {0=cephfs.gedaopl01.zjuhem=up:active} 1 up:standby
    osd: 3 osds: 3 up (since 9m), 3 in (since 9m); 18 remapped pgs

  task status:
    scrub status:
        mds.cephfs.gedaopl01.zjuhem: idle

  data:
    pools:   7 pools, 225 pgs
    objects: 13.14k objects, 77 GiB
    usage:   214 GiB used, 457 GiB / 671 GiB avail
    pgs:     1941/39432 objects degraded (4.922%)
             206 active+clean
             18  active+undersized+degraded+remapped+backfill_wait
             1   active+undersized+degraded+remapped+backfilling

  io:
    recovery: 105 MiB/s, 25 objects/s

Many thanks for your help. This was an excellent "Recovery training" :)

Am 01.12.2020 um 11:50 schrieb Stefan Kooman:
On 2020-12-01 10:21, Oliver Weinmann wrote:
Hi Stefan,

unfortunately It doesn't start.

The failed osd (osd.0) is located on gedaopl02
I can start the service but then after a minute or so it fails. Maybe
I'm looking at the wrong log file, but it's empty:
Maybe it hits a timeout.
[root@gedaopl02 ~]# tail -f
/var/log/ceph/d0920c36-2368-11eb-a5de-005056b703af/ceph-osd.0.log

Yesterday when I deleted the failed osd and recreated it there were lots
of message in the log file:

https://pastebin.com/5hH27pdR
Mostly housekeeping logs. Are your containers running in docker? A
docker logs $container-id should give you the right logs in that case.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx