Re: Cluster not recovering after OSD deamon is down

Gaurav Bafna <bafnag@xxxxxxxxx> · Tue, 3 May 2016 18:46:13 +0530

Thanks Tupper for replying.

Shouldn't the PG be remapped to other OSDs ?

Yes , removing OSD from the cluster is resulting into full recovery.
But that should not be needed , right ?

On Tue, May 3, 2016 at 6:31 PM, Tupper Cole <tcole@xxxxxxxxxx> wrote:
> The degraded pgs are mapped to the down OSD and have not mapped to a new
> OSD. Removing the OSD would likely result in a full recovery.
>
> As a note, having two monitors (or any even number of monitors) is not
> recommended. If either monitor goes down you will lose quorum. The
> recommended number of monitors for any cluster is at least three.
>
> On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:
>>
>> Hi Cephers,
>>
>> I am running a very small cluster of 3 storage and 2 monitor nodes.
>>
>> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
>> remain undersized for unknown reason.
>>
>> After I restart that 1 osd deamon, the cluster recovers in no time .
>>
>> Size of all pools are 3 and min_size is 2.
>>
>> Can anybody please help ?
>>
>> Output of  "ceph -s"
>>     cluster fac04d85-db48-4564-b821-deebda046261
>>      health HEALTH_WARN
>>             9 pgs degraded
>>             9 pgs stuck degraded
>>             9 pgs stuck unclean
>>             9 pgs stuck undersized
>>             9 pgs undersized
>>             recovery 3327/195138 objects degraded (1.705%)
>>             pool .users pg_num 512 > pgp_num 8
>>      monmap e2: 2 mons at
>> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
>>             election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
>>      osdmap e857: 69 osds: 68 up, 68 in
>>       pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
>>             279 GB used, 247 TB / 247 TB avail
>>             3327/195138 objects degraded (1.705%)
>>                  887 active+clean
>>                    9 active+undersized+degraded
>>   client io 395 B/s rd, 0 B/s wr, 0 op/s
>>
>> ceph health detail output :
>>
>> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
>> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
>> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
>> pg 7.a is stuck unclean for 322742.938959, current state
>> active+undersized+degraded, last acting [38,2]
>> pg 5.27 is stuck unclean for 322754.823455, current state
>> active+undersized+degraded, last acting [26,19]
>> pg 5.32 is stuck unclean for 322750.685684, current state
>> active+undersized+degraded, last acting [39,19]
>> pg 6.13 is stuck unclean for 322732.665345, current state
>> active+undersized+degraded, last acting [30,16]
>> pg 5.4e is stuck unclean for 331869.103538, current state
>> active+undersized+degraded, last acting [16,38]
>> pg 5.72 is stuck unclean for 331871.208948, current state
>> active+undersized+degraded, last acting [16,49]
>> pg 4.17 is stuck unclean for 331822.771240, current state
>> active+undersized+degraded, last acting [47,20]
>> pg 5.2c is stuck unclean for 323021.274535, current state
>> active+undersized+degraded, last acting [47,18]
>> pg 5.37 is stuck unclean for 323007.574395, current state
>> active+undersized+degraded, last acting [43,1]
>> pg 7.a is stuck undersized for 322487.284302, current state
>> active+undersized+degraded, last acting [38,2]
>> pg 5.27 is stuck undersized for 322487.287164, current state
>> active+undersized+degraded, last acting [26,19]
>> pg 5.32 is stuck undersized for 322487.285566, current state
>> active+undersized+degraded, last acting [39,19]
>> pg 6.13 is stuck undersized for 322487.287168, current state
>> active+undersized+degraded, last acting [30,16]
>> pg 5.4e is stuck undersized for 331351.476170, current state
>> active+undersized+degraded, last acting [16,38]
>> pg 5.72 is stuck undersized for 331351.475707, current state
>> active+undersized+degraded, last acting [16,49]
>> pg 4.17 is stuck undersized for 322487.280309, current state
>> active+undersized+degraded, last acting [47,20]
>> pg 5.2c is stuck undersized for 322487.286347, current state
>> active+undersized+degraded, last acting [47,18]
>> pg 5.37 is stuck undersized for 322487.280027, current state
>> active+undersized+degraded, last acting [43,1]
>> pg 7.a is stuck degraded for 322487.284340, current state
>> active+undersized+degraded, last acting [38,2]
>> pg 5.27 is stuck degraded for 322487.287202, current state
>> active+undersized+degraded, last acting [26,19]
>> pg 5.32 is stuck degraded for 322487.285604, current state
>> active+undersized+degraded, last acting [39,19]
>> pg 6.13 is stuck degraded for 322487.287207, current state
>> active+undersized+degraded, last acting [30,16]
>> pg 5.4e is stuck degraded for 331351.476209, current state
>> active+undersized+degraded, last acting [16,38]
>> pg 5.72 is stuck degraded for 331351.475746, current state
>> active+undersized+degraded, last acting [16,49]
>> pg 4.17 is stuck degraded for 322487.280348, current state
>> active+undersized+degraded, last acting [47,20]
>> pg 5.2c is stuck degraded for 322487.286386, current state
>> active+undersized+degraded, last acting [47,18]
>> pg 5.37 is stuck degraded for 322487.280066, current state
>> active+undersized+degraded, last acting [43,1]
>> pg 5.72 is active+undersized+degraded, acting [16,49]
>> pg 5.4e is active+undersized+degraded, acting [16,38]
>> pg 5.32 is active+undersized+degraded, acting [39,19]
>> pg 5.37 is active+undersized+degraded, acting [43,1]
>> pg 5.2c is active+undersized+degraded, acting [47,18]
>> pg 5.27 is active+undersized+degraded, acting [26,19]
>> pg 6.13 is active+undersized+degraded, acting [30,16]
>> pg 4.17 is active+undersized+degraded, acting [47,20]
>> pg 7.a is active+undersized+degraded, acting [38,2]
>> recovery 3327/195138 objects degraded (1.705%)
>> pool .users pg_num 512 > pgp_num 8
>>
>>
>> My crush map is default.
>>
>> Ceph.conf is :
>>
>> [osd]
>> osd mkfs type=xfs
>> osd recovery threads=2
>> osd disk thread ioprio class=idle
>> osd disk thread ioprio priority=7
>> osd journal=/var/lib/ceph/osd/ceph-$id/journal
>> filestore flusher=False
>> osd op num shards=3
>> debug osd=5
>> osd disk threads=2
>> osd data=/var/lib/ceph/osd/ceph-$id
>> osd op num threads per shard=5
>> osd op threads=4
>> keyring=/var/lib/ceph/osd/ceph-$id/keyring
>> osd journal size=4096
>>
>>
>> [global]
>> filestore max sync interval=10
>> auth cluster required=cephx
>> osd pool default min size=3
>> osd pool default size=3
>> public network=10.140.13.0/26
>> objecter inflight op_bytes=1073741824
>> auth service required=cephx
>> filestore min sync interval=1
>> fsid=fac04d85-db48-4564-b821-deebda046261
>> keyring=/etc/ceph/keyring
>> cluster network=10.140.13.0/26
>> auth client required=cephx
>> filestore xattr use omap=True
>> max open files=65536
>> objecter inflight ops=2048
>> osd pool default pg num=512
>> log to syslog = true
>> #err to syslog = true
>>
>>
>> --
>> Gaurav Bafna
>> 9540631400
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> Thanks,
> Tupper Cole
> Senior Storage Consultant
> Global Storage Consulting, Red Hat
> tcole@xxxxxxxxxx
> phone:  + 01 919-720-2612

-- 
Gaurav Bafna
9540631400
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com