I don't see anything obvious, sorry.. Looks like something with osd.{5, 76, 38}, which are absent from the *up* set though they are up. How about increasing log level 'debug_osd = 20' on osd.76 and restart the OSD? Thanks, Guang ---------------------------------------- > Date: Thu, 13 Aug 2015 09:10:31 -0700 > Subject: Re: Cluster health_warn 1 active+undersized+degraded/1 active+remapped > From: sdainard@xxxxxxxx > To: yguang11@xxxxxxxxxxx > CC: yangyongpeng@xxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > > OSD tree: http://pastebin.com/3z333DP4 > Crushmap: http://pastebin.com/DBd9k56m > > I realize these nodes are quite large, I have plans to break them out > into 12 OSD's/node. > > On Thu, Aug 13, 2015 at 9:02 AM, GuangYang <yguang11@xxxxxxxxxxx> wrote: >> Could you share the 'ceph osd tree dump' and CRUSH map dump ? >> >> Thanks, >> Guang >> >> >> ---------------------------------------- >>> Date: Thu, 13 Aug 2015 08:16:09 -0700 >>> From: sdainard@xxxxxxxx >>> To: yangyongpeng@xxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx >>> Subject: Re: Cluster health_warn 1 active+undersized+degraded/1 active+remapped >>> >>> I decided to set OSD 76 out and let the cluster shuffle the data off >>> that disk and then brought the OSD back in. For the most part this >>> seemed to be working, but then I had 1 object degraded and 88xxx >>> objects misplaced: >>> >>> # ceph health detail >>> HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded >>> (0.000%); recovery 88844/66089446 objects misplaced (0.134%) >>> pg 2.e7f is stuck unclean for 88398.251351, current state >>> active+remapped, last acting [58,5] >>> pg 2.143 is stuck unclean for 13892.364101, current state >>> active+remapped, last acting [16,76] >>> pg 2.968 is stuck unclean for 13892.363521, current state >>> active+remapped, last acting [44,76] >>> pg 2.5f8 is stuck unclean for 13892.377245, current state >>> active+remapped, last acting [17,76] >>> pg 2.81c is stuck unclean for 13892.363443, current state >>> active+remapped, last acting [25,76] >>> pg 2.1a3 is stuck unclean for 13892.364400, current state >>> active+remapped, last acting [16,76] >>> pg 2.2cb is stuck unclean for 13892.374390, current state >>> active+remapped, last acting [14,76] >>> pg 2.d41 is stuck unclean for 13892.373636, current state >>> active+remapped, last acting [27,76] >>> pg 2.3f9 is stuck unclean for 13892.373147, current state >>> active+remapped, last acting [35,76] >>> pg 2.a62 is stuck unclean for 86283.741920, current state >>> active+remapped, last acting [2,38] >>> pg 2.1b0 is stuck unclean for 13892.363268, current state >>> active+remapped, last acting [3,76] >>> recovery 1/66089446 objects degraded (0.000%) >>> recovery 88844/66089446 objects misplaced (0.134%) >>> >>> I say apparently because with one object degraded, none of the pg's >>> are showing degraded: >>> # ceph pg dump_stuck degraded >>> ok >>> >>> # ceph pg dump_stuck unclean >>> ok >>> pg_stat state up up_primary acting acting_primary >>> 2.e7f active+remapped [58] 58 [58,5] 58 >>> 2.143 active+remapped [16] 16 [16,76] 16 >>> 2.968 active+remapped [44] 44 [44,76] 44 >>> 2.5f8 active+remapped [17] 17 [17,76] 17 >>> 2.81c active+remapped [25] 25 [25,76] 25 >>> 2.1a3 active+remapped [16] 16 [16,76] 16 >>> 2.2cb active+remapped [14] 14 [14,76] 14 >>> 2.d41 active+remapped [27] 27 [27,76] 27 >>> 2.3f9 active+remapped [35] 35 [35,76] 35 >>> 2.a62 active+remapped [2] 2 [2,38] 2 >>> 2.1b0 active+remapped [3] 3 [3,76] 3 >>> >>> All of the OSD filesystems are below 85% full. >>> >>> I then compared a 0.94.2 cluster that was new and had not been updated >>> (current cluster is 0.94.2 which had been updated a couple times) and >>> noticed the crush map had 'tunable straw_calc_version 1' so I added it >>> to the current cluster. >>> >>> After the data moved around for about 8 hours or so I'm left with this state: >>> >>> # ceph health detail >>> HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects >>> misplaced (0.025%) >>> pg 2.e7f is stuck unclean for 149422.331848, current state >>> active+remapped, last acting [58,5] >>> pg 2.782 is stuck unclean for 64878.002464, current state >>> active+remapped, last acting [76,31] >>> recovery 16357/66089446 objects misplaced (0.025%) >>> >>> I attempted a pg repair on both of the pg's listed above, but it >>> doesn't look like anything is happening. The doc's reference an >>> inconsistent state as a use case for the repair command so that's >>> likely why. >>> >>> These 2 pg's have been the issue throughout this process so how can I >>> dig deeper to figure out what the problem is? >>> >>> # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS >>> # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5 >>> >>> >>> On Wed, Aug 12, 2015 at 6:52 PM, yangyongpeng@xxxxxxxxxxxxx >>> <yangyongpeng@xxxxxxxxxxxxx> wrote: >>>> You can try "ceph pg repair pg_id"to repair the unhealth pg."ceph health >>>> detail" command is very useful to detect unhealth pgs. >>>> >>>> ________________________________ >>>> yangyongpeng@xxxxxxxxxxxxx >>>> >>>> >>>> From: Steve Dainard >>>> Date: 2015-08-12 23:48 >>>> To: ceph-users >>>> Subject: Cluster health_warn 1 active+undersized+degraded/1 >>>> active+remapped >>>> I ran a ceph osd reweight-by-utilization yesterday and partway through >>>> had a network interruption. After the network was restored the cluster >>>> continued to rebalance but this morning the cluster has stopped >>>> rebalance and status will not change from: >>>> >>>> # ceph status >>>> cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c >>>> health HEALTH_WARN >>>> 1 pgs degraded >>>> 1 pgs stuck degraded >>>> 2 pgs stuck unclean >>>> 1 pgs stuck undersized >>>> 1 pgs undersized >>>> recovery 8163/66089054 objects degraded (0.012%) >>>> recovery 8194/66089054 objects misplaced (0.012%) >>>> monmap e24: 3 mons at >>>> {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0} >>>> election epoch 250, quorum 0,1,2 mon1,mon2,mon3 >>>> osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs >>>> pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects >>>> 251 TB used, 111 TB / 363 TB avail >>>> 8163/66089054 objects degraded (0.012%) >>>> 8194/66089054 objects misplaced (0.012%) >>>> 4142 active+clean >>>> 1 active+undersized+degraded >>>> 1 active+remapped >>>> >>>> >>>> # ceph health detail >>>> HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 2 pgs stuck unclean; >>>> 1 pgs stuck undersized; 1 pgs undersized; recovery 8163/66089054 >>>> objects degraded (0.012%); recovery 8194/66089054 objects misplaced >>>> (0.012%) >>>> pg 2.e7f is stuck unclean for 65125.554509, current state >>>> active+remapped, last acting [58,5] >>>> pg 2.782 is stuck unclean for 65140.681540, current state >>>> active+undersized+degraded, last acting [76] >>>> pg 2.782 is stuck undersized for 60568.221461, current state >>>> active+undersized+degraded, last acting [76] >>>> pg 2.782 is stuck degraded for 60568.221549, current state >>>> active+undersized+degraded, last acting [76] >>>> pg 2.782 is active+undersized+degraded, acting [76] >>>> recovery 8163/66089054 objects degraded (0.012%) >>>> recovery 8194/66089054 objects misplaced (0.012%) >>>> >>>> # ceph pg 2.e7f query >>>> "recovery_state": [ >>>> { >>>> "name": "Started\/Primary\/Active", >>>> "enter_time": "2015-08-11 15:43:09.190269", >>>> "might_have_unfound": [], >>>> "recovery_progress": { >>>> "backfill_targets": [], >>>> "waiting_on_backfill": [], >>>> "last_backfill_started": "0\/\/0\/\/-1", >>>> "backfill_info": { >>>> "begin": "0\/\/0\/\/-1", >>>> "end": "0\/\/0\/\/-1", >>>> "objects": [] >>>> }, >>>> "peer_backfill_info": [], >>>> "backfills_in_flight": [], >>>> "recovering": [], >>>> "pg_backend": { >>>> "pull_from_peer": [], >>>> "pushing": [] >>>> } >>>> }, >>>> "scrub": { >>>> "scrubber.epoch_start": "0", >>>> "scrubber.active": 0, >>>> "scrubber.waiting_on": 0, >>>> "scrubber.waiting_on_whom": [] >>>> } >>>> }, >>>> { >>>> "name": "Started", >>>> "enter_time": "2015-08-11 15:43:04.955796" >>>> } >>>> ], >>>> >>>> >>>> # ceph pg 2.782 query >>>> "recovery_state": [ >>>> { >>>> "name": "Started\/Primary\/Active", >>>> "enter_time": "2015-08-11 15:42:42.178042", >>>> "might_have_unfound": [ >>>> { >>>> "osd": "5", >>>> "status": "not queried" >>>> } >>>> ], >>>> "recovery_progress": { >>>> "backfill_targets": [], >>>> "waiting_on_backfill": [], >>>> "last_backfill_started": "0\/\/0\/\/-1", >>>> "backfill_info": { >>>> "begin": "0\/\/0\/\/-1", >>>> "end": "0\/\/0\/\/-1", >>>> "objects": [] >>>> }, >>>> "peer_backfill_info": [], >>>> "backfills_in_flight": [], >>>> "recovering": [], >>>> "pg_backend": { >>>> "pull_from_peer": [], >>>> "pushing": [] >>>> } >>>> }, >>>> "scrub": { >>>> "scrubber.epoch_start": "0", >>>> "scrubber.active": 0, >>>> "scrubber.waiting_on": 0, >>>> "scrubber.waiting_on_whom": [] >>>> } >>>> }, >>>> { >>>> "name": "Started", >>>> "enter_time": "2015-08-11 15:42:41.139709" >>>> } >>>> ], >>>> "agent_state": {} >>>> >>>> I tried restarted osd.5/58/76 but no change. >>>> >>>> Any suggestions? >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com