Re: Cluster not recovering after OSD deamon is down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The degraded pgs are mapped to the down OSD and have not mapped to a new OSD. Removing the OSD would likely result in a full recovery. 

As a note, having two monitors (or any even number of monitors) is not recommended. If either monitor goes down you will lose quorum. The recommended number of monitors for any cluster is at least three. 

On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:
Hi Cephers,

I am running a very small cluster of 3 storage and 2 monitor nodes.

After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
remain undersized for unknown reason.

After I restart that 1 osd deamon, the cluster recovers in no time .

Size of all pools are 3 and min_size is 2.

Can anybody please help ?

Output of  "ceph -s"
    cluster fac04d85-db48-4564-b821-deebda046261
     health HEALTH_WARN
            9 pgs degraded
            9 pgs stuck degraded
            9 pgs stuck unclean
            9 pgs stuck undersized
            9 pgs undersized
            recovery 3327/195138 objects degraded (1.705%)
            pool .users pg_num 512 > pgp_num 8
     monmap e2: 2 mons at
{dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
            election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
     osdmap e857: 69 osds: 68 up, 68 in
      pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
            279 GB used, 247 TB / 247 TB avail
            3327/195138 objects degraded (1.705%)
                 887 active+clean
                   9 active+undersized+degraded
  client io 395 B/s rd, 0 B/s wr, 0 op/s

ceph health detail output :

HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
degraded (1.705%); pool .users pg_num 512 > pgp_num 8
pg 7.a is stuck unclean for 322742.938959, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck unclean for 322754.823455, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck unclean for 322750.685684, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck unclean for 322732.665345, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck unclean for 331869.103538, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck unclean for 331871.208948, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck unclean for 331822.771240, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck unclean for 323021.274535, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck unclean for 323007.574395, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck undersized for 322487.284302, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck undersized for 322487.287164, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck undersized for 322487.285566, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck undersized for 322487.287168, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck undersized for 331351.476170, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck undersized for 331351.475707, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck undersized for 322487.280309, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck undersized for 322487.286347, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck undersized for 322487.280027, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck degraded for 322487.284340, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck degraded for 322487.287202, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck degraded for 322487.285604, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck degraded for 322487.287207, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck degraded for 331351.476209, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck degraded for 331351.475746, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck degraded for 322487.280348, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck degraded for 322487.286386, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck degraded for 322487.280066, current state
active+undersized+degraded, last acting [43,1]
pg 5.72 is active+undersized+degraded, acting [16,49]
pg 5.4e is active+undersized+degraded, acting [16,38]
pg 5.32 is active+undersized+degraded, acting [39,19]
pg 5.37 is active+undersized+degraded, acting [43,1]
pg 5.2c is active+undersized+degraded, acting [47,18]
pg 5.27 is active+undersized+degraded, acting [26,19]
pg 6.13 is active+undersized+degraded, acting [30,16]
pg 4.17 is active+undersized+degraded, acting [47,20]
pg 7.a is active+undersized+degraded, acting [38,2]
recovery 3327/195138 objects degraded (1.705%)
pool .users pg_num 512 > pgp_num 8


My crush map is default.

Ceph.conf is :

[osd]
osd mkfs type=xfs
osd recovery threads=2
osd disk thread ioprio class=idle
osd disk thread ioprio priority=7
osd journal=/var/lib/ceph/osd/ceph-$id/journal
filestore flusher=False
osd op num shards=3
debug osd=5
osd disk threads=2
osd data=""> osd op num threads per shard=5
osd op threads=4
keyring=/var/lib/ceph/osd/ceph-$id/keyring
osd journal size=4096


[global]
filestore max sync interval=10
auth cluster required=cephx
osd pool default min size=3
osd pool default size=3
public network=10.140.13.0/26
objecter inflight op_bytes=1073741824
auth service required=cephx
filestore min sync interval=1
fsid=fac04d85-db48-4564-b821-deebda046261
keyring=/etc/ceph/keyring
cluster network=10.140.13.0/26
auth client required=cephx
filestore xattr use omap=True
max open files=65536
objecter inflight ops=2048
osd pool default pg num=512
log to syslog = true
#err to syslog = true


--
Gaurav Bafna
9540631400
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Thanks,
Tupper Cole
Senior Storage Consultant
Global Storage Consulting, Red Hat
phone:  + 01 919-720-2612
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux