Re: Unhealthy Cluster | Remove / Purge duplicate osds | Fix daemon

Sebastian Wagner <swagner@xxxxxxxx> · Tue, 16 Mar 2021 10:37:48 +0100

Hi Oliver,

I don't know how you managed to remove all MGRs from the cluster, but
there is the documentation to manually recover from this:

> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon

Hope that helps,
Sebastian

Am 15.03.21 um 18:24 schrieb Oliver Weinmann:
> Hi Sebastian,
> 
> thanks that seems to have worked. At least on one of the two nodes. But
> now I have another problem. It seems that all mgr daemons are gone and
> ceph command is stuck.
> 
> [root@gedasvl02 ~]# cephadm ls | grep mgr
> 
> I tried to deploy a new mgr but this doesn't seem to work either:
> 
> [root@gedasvl02 ~]# cephadm ls | grep mgr
> [root@gedasvl02 ~]# cephadm deploy --fsid
> d0920c36-2368-11eb-a5de-005056b703af --name mgr.gedaopl03
> INFO:cephadm:Deploy daemon mgr.gedaopl03 ...
> 
> At least I can't see a mgr container on node gedaopl03:
> 
> [root@gedaopl03 ~]# podman ps
> CONTAINER ID  IMAGE                                
> COMMAND               CREATED     STATUS         PORTS  NAMES
> 63518d95201b  docker.io/prom/node-exporter:v0.18.1 
> --no-collector.ti...  3 days ago  Up 3 days ago        
> ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl03
> aa9b57fd77b8  docker.io/ceph/ceph:v15               -n
> client.crash.g...  3 days ago  Up 3 days ago        
> ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl03
> 8b02715f9cb4  docker.io/ceph/ceph:v15               -n osd.2 -f
> --set...  3 days ago  Up 3 days ago        
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.2
> 40f15a6357fe  docker.io/ceph/ceph:v15               -n osd.7 -f
> --set...  3 days ago  Up 3 days ago        
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.7
> bda260378239  docker.io/ceph/ceph:v15               -n
> mds.cephfs.ged...  3 days ago  Up 3 days ago        
> ceph-d0920c36-2368-11eb-a5de-005056b703af-mds.cephfs.gedaopl03.kybzgy
> [root@gedaopl03 ~]# systemctl --failed
>  
> UNIT                                                                      LOAD  
> ACTIVE SUB    DESCRIPTION
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@crash.gedaopl03.service         loaded
> failed failed Ceph crash.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@mon.gedaopl03.service           loaded
> failed failed Ceph mon.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@node-exporter.gedaopl03.service loaded
> failed failed Ceph node-exporter.gedaopl03 for
> d0920c36-2368-11eb-a5de-005056b703af
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.3.service                   loaded
> failed failed Ceph osd.3 for d0920c36-2368-11eb-a5de-005056b703af
> 
> LOAD   = Reflects whether the unit definition was properly loaded.
> ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
> SUB    = The low-level unit activation state, values depend on unit type.
> 
> 4 loaded units listed. Pass --all to see loaded but inactive units, too.
> To show all installed unit files use 'systemctl list-unit-files'.
> 
> Maybe it's best to just scrap the whole cluster. It is only for testing,
> but I guess it is also a good practice for recovery. :)
> 
> Am 12. März 2021 um 12:35 schrieb Sebastian Wagner <swagner@xxxxxxxx>:
> 
>> Hi Oliver,
>>
>> # ssh gedaopl02
>> # cephadm rm-daemon osd.0
>>
>> should do the trick.
>>
>> Be careful to remove the broken OSD :-)
>>
>> Best,
>>
>> Sebastian
>>
>> Am 11.03.21 um 22:10 schrieb Oliver Weinmann:
>>> Hi,
>>>
>>> On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite
>>> a while, I noticed that it shows some errors:
>>>
>>> [root@gedasvl02 ~]# ceph health detail
>>> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
>>> INFO:cephadm:Inferring config
>>> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>> HEALTH_WARN 2 failed cephadm daemon(s)
>>> [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
>>>     daemon osd.0 on gedaopl02 is in error state
>>>     daemon node-exporter.gedaopl01 on gedaopl01 is in error state
>>>
>>> The error about the osd.0 is strange since osd.0 is actually up and
>>> running but on a different node. I guess I missed to correctly remove it
>>> from node gedaopl02 and then added a new osd to a different node
>>> gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.
>>>
>>> [root@gedasvl02 ~]# ceph orch ps
>>> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
>>> INFO:cephadm:Inferring config
>>> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>> NAME                         HOST       STATUS        REFRESHED AGE 
>>> VERSION    IMAGE NAME                            IMAGE ID     
>>> CONTAINER ID
>>> alertmanager.gedasvl02       gedasvl02  running (6h)  7m ago 4M  
>>> 0.20.0     docker.io/prom/alertmanager:v0.20.0 0881eb8f169f  5b80fb977a5f
>>> crash.gedaopl01              gedaopl01  stopped       7m ago 4M  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  810cf432b6d6
>>> crash.gedaopl02              gedaopl02  running (5h)  7m ago 4M  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  34ab264fd5ed
>>> crash.gedaopl03              gedaopl03  running (2d)  7m ago 2d  
>>> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  233f30086d2d
>>> crash.gedasvl02              gedasvl02  running (6h)  7m ago 4M  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ea3d3e7c4f58
>>> grafana.gedasvl02            gedasvl02  running (6h)  7m ago 4M  
>>> 6.6.2      docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a  5a94f3e41c32
>>> mds.cephfs.gedaopl01.zjuhem  gedaopl01  stopped       7m ago 3M  
>>> <unknown>  docker.io/ceph/ceph:v15 <unknown>     <unknown>
>>> mds.cephfs.gedasvl02.xsjtpi  gedasvl02  running (6h)  7m ago 3M  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  26e7c8759d89
>>> mgr.gedaopl03.zilwbl         gedaopl03  running (7h)  7m ago 7h  
>>> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  e18b6f40871c
>>> mon.gedaopl03                gedaopl03  running (7h)  7m ago 7h  
>>> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  5afdf40e41ba
>>> mon.gedasvl02                gedasvl02  running (6h)  7m ago 4M  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  e83dfcd864aa
>>> node-exporter.gedaopl01      gedaopl01  error         7m ago 4M  
>>> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 
>>> 0fefcfcc9639
>>> node-exporter.gedaopl02      gedaopl02  running (5h)  7m ago 4M  
>>> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 
>>> f459045b7e41
>>> node-exporter.gedaopl03      gedaopl03  running (2d)  7m ago 2d  
>>> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 
>>> 3bd9f8dd6d5b
>>> node-exporter.gedasvl02      gedasvl02  running (6h)  7m ago 4M  
>>> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 
>>> 72e96963261e
>>> *osd.0                        gedaopl01  running (5h)  7m ago     5h  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ed76fafb1988**
>>> **osd.0                        gedaopl02  error         7m ago     4M  
>>> <unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>*
>>> osd.1                        gedaopl01  running (4h)  7m ago 3d  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  41a43733e601
>>> *osd.2                        gedaopl01  stopped       7m ago     4M  
>>> <unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>**
>>> **osd.2                        gedaopl03  running (7h)  7m ago     7h  
>>> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  ac9e660db2fb*
>>> osd.3                        gedaopl03  running (7h)  7m ago 7h  
>>> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  bde17b5bb2fb
>>> osd.4                        gedaopl02  running (5h)  7m ago 3d  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  7cc3ef7c4469
>>> osd.5                        gedaopl02  running (5h)  7m ago 3d  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  761b96d235e4
>>> osd.6                        gedaopl02  running (5h)  7m ago 3d  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  d047b28fe2bd
>>> osd.7                        gedaopl03  running (7h)  7m ago 7h  
>>> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  3b54b01841f4
>>> osd.8                        gedaopl01  running (5h)  7m ago 5h  
>>> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  cdd308cdc82b
>>> prometheus.gedasvl02         gedasvl02  running (5h)  7m ago 4M  
>>> 2.18.1     docker.io/prom/prometheus:v2.18.1 de242295e225  591cef3bbaa4
>>>
>>> Is there a way to clean / purge the stopped and error ones?
>>>
>>> I don't know what is wrong with the node-exporter. Because looking at
>>> podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon?
>>>
>>> [root@gedaopl01 ~]# podman ps -a
>>> CONTAINER ID  IMAGE COMMAND               CREATED        
>>> STATUS             PORTS NAMES
>>> e71898f7d038  docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... 
>>> 54 seconds ago  Up 54 seconds ago
>>> ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01
>>> 41a43733e601  docker.io/ceph/ceph:v15               -n osd.1 -f
>>> --set...  5 hours ago     Up 5 hours ago
>>> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1
>>> 810cf432b6d6  docker.io/ceph/ceph:v15               -n
>>> client.crash.g...  6 hours ago     Up 6 hours ago
>>> ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01
>>> cdd308cdc82b  docker.io/ceph/ceph:v15               -n osd.8 -f
>>> --set...  6 hours ago     Up 6 hours ago
>>> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8
>>> ed76fafb1988  docker.io/ceph/ceph:v15               -n osd.0 -f
>>> --set...  6 hours ago     Up 6 hours ago
>>> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0
>>>
>>> I replaced the very old disks with some brand new SAMSUMG PM883 and
>>> would like to upgrade to 15.2.9. But the upgrade guide recommends to do
>>> this on a healthy cluster only. :)
>>>
>>> Cheers,
>>>
>>> Oliver
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> <mailto:ceph-users-leave@xxxxxxx>
>>
>> -- 
>> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
>> Germany
>> (HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer
>>

-- 
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx