Cluster not recovering after OSD deamon is down

Gaurav Bafna <bafnag@xxxxxxxxx> · Tue, 3 May 2016 18:12:18 +0530

Hi Cephers,

I am running a very small cluster of 3 storage and 2 monitor nodes.

After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
remain undersized for unknown reason.

After I restart that 1 osd deamon, the cluster recovers in no time .

Size of all pools are 3 and min_size is 2.

Can anybody please help ?

Output of  "ceph -s"
    cluster fac04d85-db48-4564-b821-deebda046261
     health HEALTH_WARN
            9 pgs degraded
            9 pgs stuck degraded
            9 pgs stuck unclean
            9 pgs stuck undersized
            9 pgs undersized
            recovery 3327/195138 objects degraded (1.705%)
            pool .users pg_num 512 > pgp_num 8
     monmap e2: 2 mons at
{dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
            election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
     osdmap e857: 69 osds: 68 up, 68 in
      pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
            279 GB used, 247 TB / 247 TB avail
            3327/195138 objects degraded (1.705%)
                 887 active+clean
                   9 active+undersized+degraded
  client io 395 B/s rd, 0 B/s wr, 0 op/s

ceph health detail output :

HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
degraded (1.705%); pool .users pg_num 512 > pgp_num 8
pg 7.a is stuck unclean for 322742.938959, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck unclean for 322754.823455, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck unclean for 322750.685684, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck unclean for 322732.665345, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck unclean for 331869.103538, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck unclean for 331871.208948, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck unclean for 331822.771240, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck unclean for 323021.274535, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck unclean for 323007.574395, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck undersized for 322487.284302, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck undersized for 322487.287164, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck undersized for 322487.285566, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck undersized for 322487.287168, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck undersized for 331351.476170, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck undersized for 331351.475707, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck undersized for 322487.280309, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck undersized for 322487.286347, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck undersized for 322487.280027, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck degraded for 322487.284340, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck degraded for 322487.287202, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck degraded for 322487.285604, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck degraded for 322487.287207, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck degraded for 331351.476209, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck degraded for 331351.475746, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck degraded for 322487.280348, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck degraded for 322487.286386, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck degraded for 322487.280066, current state
active+undersized+degraded, last acting [43,1]
pg 5.72 is active+undersized+degraded, acting [16,49]
pg 5.4e is active+undersized+degraded, acting [16,38]
pg 5.32 is active+undersized+degraded, acting [39,19]
pg 5.37 is active+undersized+degraded, acting [43,1]
pg 5.2c is active+undersized+degraded, acting [47,18]
pg 5.27 is active+undersized+degraded, acting [26,19]
pg 6.13 is active+undersized+degraded, acting [30,16]
pg 4.17 is active+undersized+degraded, acting [47,20]
pg 7.a is active+undersized+degraded, acting [38,2]
recovery 3327/195138 objects degraded (1.705%)
pool .users pg_num 512 > pgp_num 8

My crush map is default.

Ceph.conf is :

[osd]
osd mkfs type=xfs
osd recovery threads=2
osd disk thread ioprio class=idle
osd disk thread ioprio priority=7
osd journal=/var/lib/ceph/osd/ceph-$id/journal
filestore flusher=False
osd op num shards=3
debug osd=5
osd disk threads=2
osd data=/var/lib/ceph/osd/ceph-$id
osd op num threads per shard=5
osd op threads=4
keyring=/var/lib/ceph/osd/ceph-$id/keyring
osd journal size=4096

[global]
filestore max sync interval=10
auth cluster required=cephx
osd pool default min size=3
osd pool default size=3
public network=10.140.13.0/26
objecter inflight op_bytes=1073741824
auth service required=cephx
filestore min sync interval=1
fsid=fac04d85-db48-4564-b821-deebda046261
keyring=/etc/ceph/keyring
cluster network=10.140.13.0/26
auth client required=cephx
filestore xattr use omap=True
max open files=65536
objecter inflight ops=2048
osd pool default pg num=512
log to syslog = true
#err to syslog = true

--
Gaurav Bafna
9540631400
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com