Thanks Tupper for replying. Shouldn't the PG be remapped to other OSDs ? Yes , removing OSD from the cluster is resulting into full recovery. But that should not be needed , right ? On Tue, May 3, 2016 at 6:31 PM, Tupper Cole <tcole@xxxxxxxxxx> wrote: > The degraded pgs are mapped to the down OSD and have not mapped to a new > OSD. Removing the OSD would likely result in a full recovery. > > As a note, having two monitors (or any even number of monitors) is not > recommended. If either monitor goes down you will lose quorum. The > recommended number of monitors for any cluster is at least three. > > On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote: >> >> Hi Cephers, >> >> I am running a very small cluster of 3 storage and 2 monitor nodes. >> >> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs >> remain undersized for unknown reason. >> >> After I restart that 1 osd deamon, the cluster recovers in no time . >> >> Size of all pools are 3 and min_size is 2. >> >> Can anybody please help ? >> >> Output of "ceph -s" >> cluster fac04d85-db48-4564-b821-deebda046261 >> health HEALTH_WARN >> 9 pgs degraded >> 9 pgs stuck degraded >> 9 pgs stuck unclean >> 9 pgs stuck undersized >> 9 pgs undersized >> recovery 3327/195138 objects degraded (1.705%) >> pool .users pg_num 512 > pgp_num 8 >> monmap e2: 2 mons at >> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0} >> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2 >> osdmap e857: 69 osds: 68 up, 68 in >> pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects >> 279 GB used, 247 TB / 247 TB avail >> 3327/195138 objects degraded (1.705%) >> 887 active+clean >> 9 active+undersized+degraded >> client io 395 B/s rd, 0 B/s wr, 0 op/s >> >> ceph health detail output : >> >> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean; >> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects >> degraded (1.705%); pool .users pg_num 512 > pgp_num 8 >> pg 7.a is stuck unclean for 322742.938959, current state >> active+undersized+degraded, last acting [38,2] >> pg 5.27 is stuck unclean for 322754.823455, current state >> active+undersized+degraded, last acting [26,19] >> pg 5.32 is stuck unclean for 322750.685684, current state >> active+undersized+degraded, last acting [39,19] >> pg 6.13 is stuck unclean for 322732.665345, current state >> active+undersized+degraded, last acting [30,16] >> pg 5.4e is stuck unclean for 331869.103538, current state >> active+undersized+degraded, last acting [16,38] >> pg 5.72 is stuck unclean for 331871.208948, current state >> active+undersized+degraded, last acting [16,49] >> pg 4.17 is stuck unclean for 331822.771240, current state >> active+undersized+degraded, last acting [47,20] >> pg 5.2c is stuck unclean for 323021.274535, current state >> active+undersized+degraded, last acting [47,18] >> pg 5.37 is stuck unclean for 323007.574395, current state >> active+undersized+degraded, last acting [43,1] >> pg 7.a is stuck undersized for 322487.284302, current state >> active+undersized+degraded, last acting [38,2] >> pg 5.27 is stuck undersized for 322487.287164, current state >> active+undersized+degraded, last acting [26,19] >> pg 5.32 is stuck undersized for 322487.285566, current state >> active+undersized+degraded, last acting [39,19] >> pg 6.13 is stuck undersized for 322487.287168, current state >> active+undersized+degraded, last acting [30,16] >> pg 5.4e is stuck undersized for 331351.476170, current state >> active+undersized+degraded, last acting [16,38] >> pg 5.72 is stuck undersized for 331351.475707, current state >> active+undersized+degraded, last acting [16,49] >> pg 4.17 is stuck undersized for 322487.280309, current state >> active+undersized+degraded, last acting [47,20] >> pg 5.2c is stuck undersized for 322487.286347, current state >> active+undersized+degraded, last acting [47,18] >> pg 5.37 is stuck undersized for 322487.280027, current state >> active+undersized+degraded, last acting [43,1] >> pg 7.a is stuck degraded for 322487.284340, current state >> active+undersized+degraded, last acting [38,2] >> pg 5.27 is stuck degraded for 322487.287202, current state >> active+undersized+degraded, last acting [26,19] >> pg 5.32 is stuck degraded for 322487.285604, current state >> active+undersized+degraded, last acting [39,19] >> pg 6.13 is stuck degraded for 322487.287207, current state >> active+undersized+degraded, last acting [30,16] >> pg 5.4e is stuck degraded for 331351.476209, current state >> active+undersized+degraded, last acting [16,38] >> pg 5.72 is stuck degraded for 331351.475746, current state >> active+undersized+degraded, last acting [16,49] >> pg 4.17 is stuck degraded for 322487.280348, current state >> active+undersized+degraded, last acting [47,20] >> pg 5.2c is stuck degraded for 322487.286386, current state >> active+undersized+degraded, last acting [47,18] >> pg 5.37 is stuck degraded for 322487.280066, current state >> active+undersized+degraded, last acting [43,1] >> pg 5.72 is active+undersized+degraded, acting [16,49] >> pg 5.4e is active+undersized+degraded, acting [16,38] >> pg 5.32 is active+undersized+degraded, acting [39,19] >> pg 5.37 is active+undersized+degraded, acting [43,1] >> pg 5.2c is active+undersized+degraded, acting [47,18] >> pg 5.27 is active+undersized+degraded, acting [26,19] >> pg 6.13 is active+undersized+degraded, acting [30,16] >> pg 4.17 is active+undersized+degraded, acting [47,20] >> pg 7.a is active+undersized+degraded, acting [38,2] >> recovery 3327/195138 objects degraded (1.705%) >> pool .users pg_num 512 > pgp_num 8 >> >> >> My crush map is default. >> >> Ceph.conf is : >> >> [osd] >> osd mkfs type=xfs >> osd recovery threads=2 >> osd disk thread ioprio class=idle >> osd disk thread ioprio priority=7 >> osd journal=/var/lib/ceph/osd/ceph-$id/journal >> filestore flusher=False >> osd op num shards=3 >> debug osd=5 >> osd disk threads=2 >> osd data=/var/lib/ceph/osd/ceph-$id >> osd op num threads per shard=5 >> osd op threads=4 >> keyring=/var/lib/ceph/osd/ceph-$id/keyring >> osd journal size=4096 >> >> >> [global] >> filestore max sync interval=10 >> auth cluster required=cephx >> osd pool default min size=3 >> osd pool default size=3 >> public network=10.140.13.0/26 >> objecter inflight op_bytes=1073741824 >> auth service required=cephx >> filestore min sync interval=1 >> fsid=fac04d85-db48-4564-b821-deebda046261 >> keyring=/etc/ceph/keyring >> cluster network=10.140.13.0/26 >> auth client required=cephx >> filestore xattr use omap=True >> max open files=65536 >> objecter inflight ops=2048 >> osd pool default pg num=512 >> log to syslog = true >> #err to syslog = true >> >> >> -- >> Gaurav Bafna >> 9540631400 >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > > Thanks, > Tupper Cole > Senior Storage Consultant > Global Storage Consulting, Red Hat > tcole@xxxxxxxxxx > phone: + 01 919-720-2612 -- Gaurav Bafna 9540631400 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com