Re: osd out cant' bring it back online

Oliver Weinmann <oliver.weinmann@xxxxxx> · Tue, 1 Dec 2020 10:21:19 +0100

Hi Stefan,

unfortunately It doesn't start.

The failed osd (osd.0) is located on gedaopl02

[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config 
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.43658  root default
-7         0.21829      host gedaopl01
 2    ssd  0.21829          osd.2           up   1.00000  1.00000
-3               0      host gedaopl02
-5         0.21829      host gedaopl03
 3    ssd  0.21829          osd.3           up   1.00000  1.00000
 0               0  osd.0                 down         0  1.00000

[root@gedaopl02 ~]# systemctl --failed
UNIT LOAD   ACTIVE SUB    DESCRIPTION
● ceph-d0920c36-2368-11eb-a5de-005056b703af@mgr.gedaopl02.pijxbm.service 
loaded failed failed Ceph mgr.gedaopl02.pijxbm for 
d0920c36-2368-11eb-a5de-005056b703af
● ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.0.service loaded failed 
failed Ceph osd.0 for d0920c36-2368-11eb-a5de-005056b703af
● ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.1.service loaded failed 
failed Ceph osd.1 for d0920c36-2368-11eb-a5de-005056b703af

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

3 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

I can start the service but then after a minute or so it fails. Maybe 
I'm looking at the wrong log file, but it's empty:

[root@gedaopl02 ~]# tail -f 
/var/log/ceph/d0920c36-2368-11eb-a5de-005056b703af/ceph-osd.0.log

Yesterday when I deleted the failed osd and recreated it there were lots 
of message in the log file:

https://pastebin.com/5hH27pdR

Cheers,

Oliver

Am 01.12.2020 um 09:22 schrieb Stefan Kooman:
On 2020-11-30 15:55, Oliver Weinmann wrote:

I have another error "pgs undersized", maybe this is also causing trouble?
This is a result of the loss of one OSD, and the PGs located on it. As
you only have 1 OSDs left, the cluster cannot recover on a third OSD
(assuming defaults here). The cluster will heal itself as soon as the
third OSD will be back online.

Can you start the OSD? If not, can you provide logs of the failing OSD?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx