Hi All. I have a ceph cluster that's partially upgraded to Luminous. Last night a host died and since then the cluster is failing to recover. It finished backfilling, but was left with thousands of requests degraded, inactive, or stale. In order to move past the issue, I put the cluster in noout,noscrub,nodeep-scrub and restarted all services one by one.
Here is the current state of the cluster, any idea how to get past the stale and stuck pgs? Any help would be very appreciated. Thanks.
-Brett
## ceph -s output
###############
$ sudo ceph -s
cluster:
id: <removed>
health: HEALTH_ERR
165 pgs are stuck inactive for more than 60 seconds
243 pgs backfill_wait
144 pgs backfilling
332 pgs degraded
5 pgs peering
1 pgs recovery_wait
22 pgs stale
332 pgs stuck degraded
143 pgs stuck inactive
22 pgs stuck stale
531 pgs stuck unclean
330 pgs stuck undersized
330 pgs undersized
671 requests are blocked > 32 sec
603 requests are blocked > 4096 sec
recovery 3524906/412016682 objects degraded (0.856%)
recovery 2462252/412016682 objects misplaced (0.598%)
noout,noscrub,nodeep-scrub flag(s) set
mon.ceph0rdi-mon1-1-prd store is getting too big! 17612 MB >= 15360 MB
mon.ceph0rdi-mon2-1-prd store is getting too big! 17669 MB >= 15360 MB
mon.ceph0rdi-mon3-1-prd store is getting too big! 17586 MB >= 15360 MB
services:
mon: 3 daemons, quorum ceph0rdi-mon1-1-prd,ceph0rdi-mon2-1-prd,ceph0rdi-mon3-1-prd
mgr: ceph0rdi-mon3-1-prd(active), standbys: ceph0rdi-mon2-1-prd, ceph0rdi-mon1-1-prd
osd: 222 osds: 218 up, 218 in; 428 remapped pgs
flags noout,noscrub,nodeep-scrub
data:
pools: 35 pools, 38144 pgs
objects: 130M objects, 172 TB
usage: 538 TB used, 337 TB / 875 TB avail
pgs: 0.375% pgs not active
3524906/412016682 objects degraded (0.856%)
2462252/412016682 objects misplaced (0.598%)
37599 active+clean
173 active+undersized+degraded+remapped+backfill_wait
133 active+undersized+degraded+remapped+backfilling
93 activating
68 active+remapped+backfill_wait
22 activating+undersized+degraded+remapped
13 stale+active+clean
11 active+remapped+backfilling
9 activating+remapped
5 remapped
5 stale+activating+remapped
3 remapped+peering
2 stale+remapped
2 stale+remapped+peering
1 activating+degraded+remapped
1 active+clean+remapped
1 active+degraded+remapped+backfill_wait
1 active+undersized+remapped+backfill_wait
1 activating+degraded
1 active+recovery_wait+undersized+degraded+remapped
io:
client: 187 kB/s rd, 2595 kB/s wr, 99 op/s rd, 343 op/s wr
recovery: 1509 MB/s, 1541 objects/s
## ceph pg dump_stuck stale (this number doesn't seem to decrease)
########################################################
$ sudo ceph pg dump_stuck stale
ok
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
17.6d7 stale+remapped [5,223,96] 5 [223,96,148] 223
2.5c5 stale+active+clean [224,48,179] 224 [224,48,179] 224
17.64e stale+active+clean [224,84,109] 224 [224,84,109] 224
19.5b4 stale+activating+remapped [124,130,20] 124 [124,20,11] 124
17.4c6 stale+active+clean [224,216,95] 224 [224,216,95] 224
73.413 stale+activating+remapped [117,130,189] 117 [117,189,137] 117
2.431 stale+remapped+peering [5,180,142] 5 [180,142,40] 180
69.1dc stale+active+clean [62,36,54] 62 [62,36,54] 62
14.790 stale+active+clean [81,114,19] 81 [81,114,19] 81
2.78e stale+active+clean [224,143,124] 224 [224,143,124] 224
73.37a stale+active+clean [224,84,38] 224 [224,84,38] 224
17.42d stale+activating+remapped [220,130,25] 220 [220,25,137] 220
72.263 stale+active+clean [224,148,117] 224 [224,148,117] 224
67.40 stale+active+clean [62,170,71] 62 [62,170,71] 62
67.16d stale+remapped+peering [3,147,22] 3 [147,22,29] 147
20.3de stale+active+clean [224,103,126] 224 [224,103,126] 224
19.721 stale+remapped [3,34,179] 3 [34,179,128] 34
19.2f1 stale+activating+remapped [126,130,178] 126 [126,178,72] 126
74.28b stale+active+clean [224,95,56] 224 [224,95,56] 224
20.6b6 stale+active+clean [224,56,126] 224 [224,56,126] 224
2.2ac stale+active+clean [224,223,143] 224 [224,223,143] 224
73.11c stale+activating+remapped [91,130,201] 91 [91,201,137] 91
## Queries on the pg's don't seem to work
##################################
$ sudo ceph pg 2.5c5 query
Error ENOENT: i don't have pgid 2.5c5
$ sudo ceph pg 17.6d7 query
Error ENOENT: i don't have pgid 17.6d7
## Ceph versions (in case that helps)
##############################
$ sudo ceph versions
{
"mon": {
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 3
},
"osd": {
"ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)": 60,
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 158
},
"mds": {},
"overall": {
"ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)": 60,
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 164
}
}
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com