Re: ceph mons and osds are down

ashley@xxxxxxxxxxxxxx · Tue, 22 Feb 2022 14:48:35 +0000

Need to work out why the 4 aren’t starting then.

First I would check they are showing in the OS layer via dmesg or fdisk e.t.c

If you can see the correct amount of disks on each node then check the service status / ceph logs for each osd.

Depending how you setup the cluster/osd depends on the location of the log files.

> On 22 Feb 2022, at 14:45, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
> 
> I should have 10 OSDs, below is the output:
> 
> root@ceph-mon1:~# ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
> -1         1.95297  root default
> -5         0.78119      host ceph-mon1
> 2    hdd  0.19530          osd.2         down         0  1.00000
> 4    hdd  0.19530          osd.4         down         0  1.00000
> 8    hdd  0.19530          osd.8           up   1.00000  1.00000
> 9    hdd  0.19530          osd.9           up   1.00000  1.00000
> -7         0.58589      host ceph-mon2
> 1    hdd  0.19530          osd.1           up   1.00000  1.00000
> 5    hdd  0.19530          osd.5           up   1.00000  1.00000
> 7    hdd  0.19530          osd.7           up   1.00000  1.00000
> -3         0.58589      host ceph-mon3
> 0    hdd  0.19530          osd.0         down   1.00000  1.00000
> 3    hdd  0.19530          osd.3           up   1.00000  1.00000
> 6    hdd  0.19530          osd.6         down         0  1.00000
> 
> Try to restart these are down but failed
> 
> On Tue, Feb 22, 2022 at 4:42 PM <ashley@xxxxxxxxxxxxxx> wrote:
> 
>> What does
>> 
>> ‘ceph osd tree’ show?
>> 
>> How many OSD’s should you have 7 or 10?
>> 
>> On 22 Feb 2022, at 14:40, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>> 
>> Actually one of my colleagues tried to reboot all nodes and he did not
>> prepare the node like setting noout , norecover ......, once all node are
>> up the cluster is no longer accessible and above are messages we are
>> getting. I did not remove any osd . except are marked down.
>> below is my ceph.conf:
>> 
>> mon initial members = ceph-mon1,ceph-mon2,ceph-mon3
>> mon_allow_pool_delete = True
>> mon_clock_drift_allowed = 0.5
>> mon_max_pg_per_osd = 400
>> mon_osd_allow_primary_affinity = 1
>> mon_pg_warn_max_object_skew = 0
>> mon_pg_warn_max_per_osd = 0
>> mon_pg_warn_min_per_osd = 0
>> osd pool default crush rule = -1
>> osd_pool_default_min_size = 1
>> osd_pool_default_size = 2
>> public network = 0.0.0.0/0
>> 
>> On Tue, Feb 22, 2022 at 4:32 PM <ashley@xxxxxxxxxxxxxx> wrote:
>> 
>>> You have 1 OSD offline, has this disk failed or you aware of what has
>>> caused this to go offline?
>>> Shows you have 10 OSD’s but only 7in, have you removed the other 3? Was
>>> the data fully drained off these first?
>>> 
>>> I see you have 11 Pool’s what are these setup as, type and min/max size?
>>> 
>>>> On 22 Feb 2022, at 14:15, Michel Niyoyita <micou12@xxxxxxxxx> wrote:
>>>> 
>>>> Dear Ceph Users,
>>>> 
>>>> Kindly help me to repair my cluster is down from yesterday up to now I
>>> am
>>>> not able to make it up and running . below are some findings:
>>>> 
>>>>   id:     6ad86187-2738-42d8-8eec-48b2a43c298f
>>>>   health: HEALTH_ERR
>>>>           mons are allowing insecure global_id reclaim
>>>>           1/3 mons down, quorum ceph-mon1,ceph-mon3
>>>>           10/32332 objects unfound (0.031%)
>>>>           1 osds down
>>>>           3 scrub errors
>>>>           Reduced data availability: 124 pgs inactive, 60 pgs down, 411
>>>> pgs stale
>>>>           Possible data damage: 9 pgs recovery_unfound, 1 pg
>>>> backfill_unfound, 1 pg inconsistent
>>>>           Degraded data redundancy: 6009/64664 objects degraded
>>> (9.293%),
>>>> 55 pgs degraded, 80 pgs undersized
>>>>           11 pgs not deep-scrubbed in time
>>>>           5 slow ops, oldest one blocked for 1638 sec, osd.9 has slow
>>> ops
>>>> 
>>>> services:
>>>>   mon: 3 daemons, quorum ceph-mon1,ceph-mon3 (age 3h), out of quorum:
>>>> ceph-mon2
>>>>   mgr: ceph-mon1(active, since 9h), standbys: ceph-mon2
>>>>   osd: 10 osds: 6 up (since 7h), 7 in (since 9h); 43 remapped pgs
>>>> 
>>>> data:
>>>>   pools:   11 pools, 560 pgs
>>>>   objects: 32.33k objects, 159 GiB
>>>>   usage:   261 GiB used, 939 GiB / 1.2 TiB avail
>>>>   pgs:     11.429% pgs unknown
>>>>            10.714% pgs not active
>>>>            6009/64664 objects degraded (9.293%)
>>>>            1384/64664 objects misplaced (2.140%)
>>>>            10/32332 objects unfound (0.031%)
>>>>            245 stale+active+clean
>>>>            70  active+clean
>>>>            64  unknown
>>>>            48  stale+down
>>>>            45  stale+active+undersized+degraded
>>>>            37  stale+active+clean+remapped
>>>>            28  stale+active+undersized
>>>>            12  down
>>>>            2   stale+active+recovery_unfound+degraded
>>>>            2   stale+active+recovery_unfound+undersized+degraded
>>>>            2
>>> stale+active+recovery_unfound+undersized+degraded+remapped
>>>>            2   active+recovery_unfound+undersized+degraded+remapped
>>>>            1   active+clean+inconsistent
>>>>            1   stale+active+recovery_unfound+degraded+remapped
>>>>            1
>>> stale+active+backfill_unfound+undersized+degraded+remapped
>>>> 
>>>> If someone faced same issue please help me.
>>>> 
>>>> Best Regards.
>>>> 
>>>> Michel
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx