Also , the old PGs are not mapped to the down osd as seen from the ceph health detail pg 5.72 is active+undersized+degraded, acting [16,49] pg 5.4e is active+undersized+degraded, acting [16,38] pg 5.32 is active+undersized+degraded, acting [39,19] pg 5.37 is active+undersized+degraded, acting [43,1] pg 5.2c is active+undersized+degraded, acting [47,18] pg 5.27 is active+undersized+degraded, acting [26,19] pg 6.13 is active+undersized+degraded, acting [30,16] pg 4.17 is active+undersized+degraded, acting [47,20] pg 7.a is active+undersized+degraded, acting [38,2] >From pg query of 7.a { "state": "active+undersized+degraded", "snap_trimq": "[]", "epoch": 857, "up": [ 38, 2 ], "acting": [ 38, 2 ], "actingbackfill": [ "2", "38" ], "info": { "pgid": "7.a", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 13, "last_epoch_started": 818, "last_epoch_clean": 818, "last_epoch_split": 0, "same_up_since": 817, "same_interval_since": 817, Complete pq query info at : http://pastebin.com/ZHB6M4PQ On Tue, May 3, 2016 at 6:46 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote: > Thanks Tupper for replying. > > Shouldn't the PG be remapped to other OSDs ? > > Yes , removing OSD from the cluster is resulting into full recovery. > But that should not be needed , right ? > > > > On Tue, May 3, 2016 at 6:31 PM, Tupper Cole <tcole@xxxxxxxxxx> wrote: >> The degraded pgs are mapped to the down OSD and have not mapped to a new >> OSD. Removing the OSD would likely result in a full recovery. >> >> As a note, having two monitors (or any even number of monitors) is not >> recommended. If either monitor goes down you will lose quorum. The >> recommended number of monitors for any cluster is at least three. >> >> On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote: >>> >>> Hi Cephers, >>> >>> I am running a very small cluster of 3 storage and 2 monitor nodes. >>> >>> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs >>> remain undersized for unknown reason. >>> >>> After I restart that 1 osd deamon, the cluster recovers in no time . >>> >>> Size of all pools are 3 and min_size is 2. >>> >>> Can anybody please help ? >>> >>> Output of "ceph -s" >>> cluster fac04d85-db48-4564-b821-deebda046261 >>> health HEALTH_WARN >>> 9 pgs degraded >>> 9 pgs stuck degraded >>> 9 pgs stuck unclean >>> 9 pgs stuck undersized >>> 9 pgs undersized >>> recovery 3327/195138 objects degraded (1.705%) >>> pool .users pg_num 512 > pgp_num 8 >>> monmap e2: 2 mons at >>> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0} >>> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2 >>> osdmap e857: 69 osds: 68 up, 68 in >>> pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects >>> 279 GB used, 247 TB / 247 TB avail >>> 3327/195138 objects degraded (1.705%) >>> 887 active+clean >>> 9 active+undersized+degraded >>> client io 395 B/s rd, 0 B/s wr, 0 op/s >>> >>> ceph health detail output : >>> >>> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean; >>> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects >>> degraded (1.705%); pool .users pg_num 512 > pgp_num 8 >>> pg 7.a is stuck unclean for 322742.938959, current state >>> active+undersized+degraded, last acting [38,2] >>> pg 5.27 is stuck unclean for 322754.823455, current state >>> active+undersized+degraded, last acting [26,19] >>> pg 5.32 is stuck unclean for 322750.685684, current state >>> active+undersized+degraded, last acting [39,19] >>> pg 6.13 is stuck unclean for 322732.665345, current state >>> active+undersized+degraded, last acting [30,16] >>> pg 5.4e is stuck unclean for 331869.103538, current state >>> active+undersized+degraded, last acting [16,38] >>> pg 5.72 is stuck unclean for 331871.208948, current state >>> active+undersized+degraded, last acting [16,49] >>> pg 4.17 is stuck unclean for 331822.771240, current state >>> active+undersized+degraded, last acting [47,20] >>> pg 5.2c is stuck unclean for 323021.274535, current state >>> active+undersized+degraded, last acting [47,18] >>> pg 5.37 is stuck unclean for 323007.574395, current state >>> active+undersized+degraded, last acting [43,1] >>> pg 7.a is stuck undersized for 322487.284302, current state >>> active+undersized+degraded, last acting [38,2] >>> pg 5.27 is stuck undersized for 322487.287164, current state >>> active+undersized+degraded, last acting [26,19] >>> pg 5.32 is stuck undersized for 322487.285566, current state >>> active+undersized+degraded, last acting [39,19] >>> pg 6.13 is stuck undersized for 322487.287168, current state >>> active+undersized+degraded, last acting [30,16] >>> pg 5.4e is stuck undersized for 331351.476170, current state >>> active+undersized+degraded, last acting [16,38] >>> pg 5.72 is stuck undersized for 331351.475707, current state >>> active+undersized+degraded, last acting [16,49] >>> pg 4.17 is stuck undersized for 322487.280309, current state >>> active+undersized+degraded, last acting [47,20] >>> pg 5.2c is stuck undersized for 322487.286347, current state >>> active+undersized+degraded, last acting [47,18] >>> pg 5.37 is stuck undersized for 322487.280027, current state >>> active+undersized+degraded, last acting [43,1] >>> pg 7.a is stuck degraded for 322487.284340, current state >>> active+undersized+degraded, last acting [38,2] >>> pg 5.27 is stuck degraded for 322487.287202, current state >>> active+undersized+degraded, last acting [26,19] >>> pg 5.32 is stuck degraded for 322487.285604, current state >>> active+undersized+degraded, last acting [39,19] >>> pg 6.13 is stuck degraded for 322487.287207, current state >>> active+undersized+degraded, last acting [30,16] >>> pg 5.4e is stuck degraded for 331351.476209, current state >>> active+undersized+degraded, last acting [16,38] >>> pg 5.72 is stuck degraded for 331351.475746, current state >>> active+undersized+degraded, last acting [16,49] >>> pg 4.17 is stuck degraded for 322487.280348, current state >>> active+undersized+degraded, last acting [47,20] >>> pg 5.2c is stuck degraded for 322487.286386, current state >>> active+undersized+degraded, last acting [47,18] >>> pg 5.37 is stuck degraded for 322487.280066, current state >>> active+undersized+degraded, last acting [43,1] >>> pg 5.72 is active+undersized+degraded, acting [16,49] >>> pg 5.4e is active+undersized+degraded, acting [16,38] >>> pg 5.32 is active+undersized+degraded, acting [39,19] >>> pg 5.37 is active+undersized+degraded, acting [43,1] >>> pg 5.2c is active+undersized+degraded, acting [47,18] >>> pg 5.27 is active+undersized+degraded, acting [26,19] >>> pg 6.13 is active+undersized+degraded, acting [30,16] >>> pg 4.17 is active+undersized+degraded, acting [47,20] >>> pg 7.a is active+undersized+degraded, acting [38,2] >>> recovery 3327/195138 objects degraded (1.705%) >>> pool .users pg_num 512 > pgp_num 8 >>> >>> >>> My crush map is default. >>> >>> Ceph.conf is : >>> >>> [osd] >>> osd mkfs type=xfs >>> osd recovery threads=2 >>> osd disk thread ioprio class=idle >>> osd disk thread ioprio priority=7 >>> osd journal=/var/lib/ceph/osd/ceph-$id/journal >>> filestore flusher=False >>> osd op num shards=3 >>> debug osd=5 >>> osd disk threads=2 >>> osd data=/var/lib/ceph/osd/ceph-$id >>> osd op num threads per shard=5 >>> osd op threads=4 >>> keyring=/var/lib/ceph/osd/ceph-$id/keyring >>> osd journal size=4096 >>> >>> >>> [global] >>> filestore max sync interval=10 >>> auth cluster required=cephx >>> osd pool default min size=3 >>> osd pool default size=3 >>> public network=10.140.13.0/26 >>> objecter inflight op_bytes=1073741824 >>> auth service required=cephx >>> filestore min sync interval=1 >>> fsid=fac04d85-db48-4564-b821-deebda046261 >>> keyring=/etc/ceph/keyring >>> cluster network=10.140.13.0/26 >>> auth client required=cephx >>> filestore xattr use omap=True >>> max open files=65536 >>> objecter inflight ops=2048 >>> osd pool default pg num=512 >>> log to syslog = true >>> #err to syslog = true >>> >>> >>> -- >>> Gaurav Bafna >>> 9540631400 >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> -- >> >> Thanks, >> Tupper Cole >> Senior Storage Consultant >> Global Storage Consulting, Red Hat >> tcole@xxxxxxxxxx >> phone: + 01 919-720-2612 > > > > -- > Gaurav Bafna > 9540631400 -- Gaurav Bafna 9540631400 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com