Re: Cluster health_warn 1 active+undersized+degraded/1 active+remapped

Steve Dainard <sdainard@xxxxxxxx> · Thu, 13 Aug 2015 08:16:09 -0700

I decided to set OSD 76 out and let the cluster shuffle the data off
that disk and then brought the OSD back in. For the most part this
seemed to be working, but then I had 1 object degraded and 88xxx
objects misplaced:

# ceph health detail
HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
(0.000%); recovery 88844/66089446 objects misplaced (0.134%)
pg 2.e7f is stuck unclean for 88398.251351, current state
active+remapped, last acting [58,5]
pg 2.143 is stuck unclean for 13892.364101, current state
active+remapped, last acting [16,76]
pg 2.968 is stuck unclean for 13892.363521, current state
active+remapped, last acting [44,76]
pg 2.5f8 is stuck unclean for 13892.377245, current state
active+remapped, last acting [17,76]
pg 2.81c is stuck unclean for 13892.363443, current state
active+remapped, last acting [25,76]
pg 2.1a3 is stuck unclean for 13892.364400, current state
active+remapped, last acting [16,76]
pg 2.2cb is stuck unclean for 13892.374390, current state
active+remapped, last acting [14,76]
pg 2.d41 is stuck unclean for 13892.373636, current state
active+remapped, last acting [27,76]
pg 2.3f9 is stuck unclean for 13892.373147, current state
active+remapped, last acting [35,76]
pg 2.a62 is stuck unclean for 86283.741920, current state
active+remapped, last acting [2,38]
pg 2.1b0 is stuck unclean for 13892.363268, current state
active+remapped, last acting [3,76]
recovery 1/66089446 objects degraded (0.000%)
recovery 88844/66089446 objects misplaced (0.134%)

I say apparently because with one object degraded, none of the pg's
are showing degraded:
# ceph pg dump_stuck degraded
ok

# ceph pg dump_stuck unclean
ok
pg_stat state up up_primary acting acting_primary
2.e7f active+remapped [58] 58 [58,5] 58
2.143 active+remapped [16] 16 [16,76] 16
2.968 active+remapped [44] 44 [44,76] 44
2.5f8 active+remapped [17] 17 [17,76] 17
2.81c active+remapped [25] 25 [25,76] 25
2.1a3 active+remapped [16] 16 [16,76] 16
2.2cb active+remapped [14] 14 [14,76] 14
2.d41 active+remapped [27] 27 [27,76] 27
2.3f9 active+remapped [35] 35 [35,76] 35
2.a62 active+remapped [2] 2 [2,38] 2
2.1b0 active+remapped [3] 3 [3,76] 3

All of the OSD filesystems are below 85% full.

I then compared a 0.94.2 cluster that was new and had not been updated
(current cluster is 0.94.2 which had been updated a couple times) and
noticed the crush map had 'tunable straw_calc_version 1' so I added it
to the current cluster.

After the data moved around for about 8 hours or so I'm left with this state:

# ceph health detail
HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
misplaced (0.025%)
pg 2.e7f is stuck unclean for 149422.331848, current state
active+remapped, last acting [58,5]
pg 2.782 is stuck unclean for 64878.002464, current state
active+remapped, last acting [76,31]
recovery 16357/66089446 objects misplaced (0.025%)

I attempted a pg repair on both of the pg's listed above, but it
doesn't look like anything is happening. The doc's reference an
inconsistent state as a use case for the repair command so that's
likely why.

These 2 pg's have been the issue throughout this process so how can I
dig deeper to figure out what the problem is?

# ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
# ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5

On Wed, Aug 12, 2015 at 6:52 PM, yangyongpeng@xxxxxxxxxxxxx
<yangyongpeng@xxxxxxxxxxxxx> wrote:
> You can try "ceph pg repair pg_id"to repair the unhealth pg."ceph health
> detail" command is very useful to detect unhealth pgs.
>
> ________________________________
> yangyongpeng@xxxxxxxxxxxxx
>
>
> From: Steve Dainard
> Date: 2015-08-12 23:48
> To: ceph-users
> Subject:  Cluster health_warn 1 active+undersized+degraded/1
> active+remapped
> I ran a ceph osd reweight-by-utilization yesterday and partway through
> had a network interruption. After the network was restored the cluster
> continued to rebalance but this morning the cluster has stopped
> rebalance and status will not change from:
>
> # ceph status
>     cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c
>      health HEALTH_WARN
>             1 pgs degraded
>             1 pgs stuck degraded
>             2 pgs stuck unclean
>             1 pgs stuck undersized
>             1 pgs undersized
>             recovery 8163/66089054 objects degraded (0.012%)
>             recovery 8194/66089054 objects misplaced (0.012%)
>      monmap e24: 3 mons at
> {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0}
>             election epoch 250, quorum 0,1,2 mon1,mon2,mon3
>      osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs
>       pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects
>             251 TB used, 111 TB / 363 TB avail
>             8163/66089054 objects degraded (0.012%)
>             8194/66089054 objects misplaced (0.012%)
>                 4142 active+clean
>                    1 active+undersized+degraded
>                    1 active+remapped
>
>
> # ceph health detail
> HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 2 pgs stuck unclean;
> 1 pgs stuck undersized; 1 pgs undersized; recovery 8163/66089054
> objects degraded (0.012%); recovery 8194/66089054 objects misplaced
> (0.012%)
> pg 2.e7f is stuck unclean for 65125.554509, current state
> active+remapped, last acting [58,5]
> pg 2.782 is stuck unclean for 65140.681540, current state
> active+undersized+degraded, last acting [76]
> pg 2.782 is stuck undersized for 60568.221461, current state
> active+undersized+degraded, last acting [76]
> pg 2.782 is stuck degraded for 60568.221549, current state
> active+undersized+degraded, last acting [76]
> pg 2.782 is active+undersized+degraded, acting [76]
> recovery 8163/66089054 objects degraded (0.012%)
> recovery 8194/66089054 objects misplaced (0.012%)
>
> # ceph pg 2.e7f query
>     "recovery_state": [
>         {
>             "name": "Started\/Primary\/Active",
>             "enter_time": "2015-08-11 15:43:09.190269",
>             "might_have_unfound": [],
>             "recovery_progress": {
>                 "backfill_targets": [],
>                 "waiting_on_backfill": [],
>                 "last_backfill_started": "0\/\/0\/\/-1",
>                 "backfill_info": {
>                     "begin": "0\/\/0\/\/-1",
>                     "end": "0\/\/0\/\/-1",
>                     "objects": []
>                 },
>                 "peer_backfill_info": [],
>                 "backfills_in_flight": [],
>                 "recovering": [],
>                 "pg_backend": {
>                     "pull_from_peer": [],
>                     "pushing": []
>                 }
>             },
>             "scrub": {
>                 "scrubber.epoch_start": "0",
>                 "scrubber.active": 0,
>                 "scrubber.waiting_on": 0,
>                 "scrubber.waiting_on_whom": []
>             }
>         },
>         {
>             "name": "Started",
>             "enter_time": "2015-08-11 15:43:04.955796"
>         }
>     ],
>
>
> # ceph pg 2.782 query
>   "recovery_state": [
>         {
>             "name": "Started\/Primary\/Active",
>             "enter_time": "2015-08-11 15:42:42.178042",
>             "might_have_unfound": [
>                 {
>                     "osd": "5",
>                     "status": "not queried"
>                 }
>             ],
>             "recovery_progress": {
>                 "backfill_targets": [],
>                 "waiting_on_backfill": [],
>                 "last_backfill_started": "0\/\/0\/\/-1",
>                 "backfill_info": {
>                     "begin": "0\/\/0\/\/-1",
>                     "end": "0\/\/0\/\/-1",
>                     "objects": []
>                 },
>                 "peer_backfill_info": [],
>                 "backfills_in_flight": [],
>                 "recovering": [],
>                 "pg_backend": {
>                     "pull_from_peer": [],
>                     "pushing": []
>                 }
>             },
>             "scrub": {
>                 "scrubber.epoch_start": "0",
>                 "scrubber.active": 0,
>                 "scrubber.waiting_on": 0,
>                 "scrubber.waiting_on_whom": []
>             }
>         },
>         {
>             "name": "Started",
>             "enter_time": "2015-08-11 15:42:41.139709"
>         }
>     ],
>     "agent_state": {}
>
> I tried restarted osd.5/58/76 but no change.
>
> Any suggestions?
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com