Greetings,
Problem: After removing (out + crush remove + auth del + osd rm) three osds on a single host, I have six pgs that, after 10 days of recovery, are stuck in a state of active+undersized+degraded+remapped+backfilling. - Hammer 0.94.6
- object storage use only - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
- failure domain of host
- cluster is currently storing 178TB over 260 MObjects (5-6% utilization per OSD)
- all 6 stuck pgs belong to .rgw.buckets
The relevant section of our crushmap:
rule .rgw.buckets {
ruleset 1
type erasure
min_size 7
max_size 9
step set_chooseleaf_tries 5
step set_choose_tries 250
step take default
step chooseleaf indep 0 type host
step emit
}
rule .rgw.buckets {
ruleset 1
type erasure
min_size 7
max_size 9
step set_chooseleaf_tries 5
step set_choose_tries 250
step take default
step chooseleaf indep 0 type host
step emit
}
This isn't the first time we've lost a disk (not even the first time we've lost multiple disks on a host in a single event), so we're used to the extended recovery times and understand this is going to be A Thing until we can introduce SSD journals. This is, however, the first time we've had pgs not return to an active+clean state after a couple days. As far as I can tell, our cluster is no longer making progress on the backfill operations, and I'm looking for advice on how to get things moving again.
Here's a dump of the stuck pgs:
ceph pg dump_stuck
ok
pg_stat state up up_primary acting acting_primary
33.151d active+undersized+degraded+remapped+backfilling [424,546,273,167,471,631,155,38,47] 424 [424,546,273,167,471,631,155,38,2147483647] 424
33.6c1 active+undersized+degraded+remapped+backfilling [453,86,565,266,338,580,297,577,404] 453 [453,86,565,266,338,2147483647,297,577,404] 453
33.17b7 active+undersized+degraded+remapped+backfilling [399,432,437,541,547,219,229,104,47] 399 [399,432,437,541,547,219,229,104,2147483647] 399
33.150d active+undersized+degraded+remapped+backfilling [555,452,511,550,643,431,141,329,486] 555 [555,2147483647,511,550,643,431,141,329,486] 555
33.13a8 active+undersized+degraded+remapped+backfilling [507,317,276,617,565,28,471,200,382] 507 [507,2147483647,276,617,565,28,471,200,382] 507
33.4c1 active+undersized+degraded+remapped+backfilling [413,440,464,129,641,416,295,266,431] 413 [413,440,2147483647,129,641,416,295,266,431] 413
ceph pg dump_stuck
ok
pg_stat state up up_primary acting acting_primary
33.151d active+undersized+degraded+remapped+backfilling [424,546,273,167,471,631,155,38,47] 424 [424,546,273,167,471,631,155,38,2147483647] 424
33.6c1 active+undersized+degraded+remapped+backfilling [453,86,565,266,338,580,297,577,404] 453 [453,86,565,266,338,2147483647,297,577,404] 453
33.17b7 active+undersized+degraded+remapped+backfilling [399,432,437,541,547,219,229,104,47] 399 [399,432,437,541,547,219,229,104,2147483647] 399
33.150d active+undersized+degraded+remapped+backfilling [555,452,511,550,643,431,141,329,486] 555 [555,2147483647,511,550,643,431,141,329,486] 555
33.13a8 active+undersized+degraded+remapped+backfilling [507,317,276,617,565,28,471,200,382] 507 [507,2147483647,276,617,565,28,471,200,382] 507
33.4c1 active+undersized+degraded+remapped+backfilling [413,440,464,129,641,416,295,266,431] 413 [413,440,2147483647,129,641,416,295,266,431] 413
Based on a review of previous postings about this issue, I initially suspected that crush couldn't map the pg to an OSD (based on MAX_INT in the acting list), so I increased set_choose_tries from 50 to 200, and then again to 250 just to see if it would do anything. These changes had no effect that I could discern.
I next reviewed the output of ceph pg <pgid> query, and I see something similar to the following for each of my stuck pgs:
{
"state": "active+undersized+degraded+remapped+backfilling",
"snap_trimq": "[]",
"epoch": 25211,
"up": [
453,
86,
565,
266,
338,
580,
297,
577,
404
],
"acting": [
453,
86,
565,
266,
338,
2147483647,
297,
577,
404
],
"backfill_targets": [
"580(5)"
],
"actingbackfill": [
"86(1)",
"266(3)",
"297(6)",
"338(4)",
"404(8)",
"453(0)",
"565(2)",
"577(7)",
"580(5)"
]
"state": "active+undersized+degraded+remapped+backfilling",
"snap_trimq": "[]",
"epoch": 25211,
"up": [
453,
86,
565,
266,
338,
580,
297,
577,
404
],
"acting": [
453,
86,
565,
266,
338,
2147483647,
297,
577,
404
],
"backfill_targets": [
"580(5)"
],
"actingbackfill": [
"86(1)",
"266(3)",
"297(6)",
"338(4)",
"404(8)",
"453(0)",
"565(2)",
"577(7)",
"580(5)"
]
In this case, 580 is a valid OSD on the node that lost the 3 OSDs (node 7). For the other five pgs, the situation is the same -- the backfill target is a valid OSD on node 7.
If I dig further into the 'query' output, I encounter the following:
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2016-07-24 18:52:51.653375",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [
"580(5)"
],
"waiting_on_backfill": [
"580(5)"
],
"last_backfill_started": "981926c1\/default.421929.15_MY_OBJECT",
"backfill_info": {
"begin": "391926c1\/default.9468.416_0080a34a\/head\/\/33",
"end": "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",
"objects": [
{
"object": "391926c1\/default.9468.416_0080a34a\/head\/\/33",
"version": "5356'86333"
},
...
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2016-07-24 18:52:51.653375",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [
"580(5)"
],
"waiting_on_backfill": [
"580(5)"
],
"last_backfill_started": "981926c1\/default.421929.15_MY_OBJECT",
"backfill_info": {
"begin": "391926c1\/default.9468.416_0080a34a\/head\/\/33",
"end": "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",
"objects": [
{
"object": "391926c1\/default.9468.416_0080a34a\/head\/\/33",
"version": "5356'86333"
},
...
Trying to understand what was going on, I shut off client traffic to the cluster and set debug_osd 20 debug_ms 1 on osd.580. I see the following section repeated ad infinitum:
=== BEGIN LOG ===
10.54.10.27:6818/913781 <== osd.453 10.54.10.28:7010/1375782 236358 ==== pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 ==== 309+0+0 (3953350617 0 0) 0x3a11d700 con 0x3656c420
2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226 handle_replica_op pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 epoch 25226
2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226 should_share_map osd.453 10.54.10.28:7010/1375782 25226
2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]
2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] handle_message: pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] do_scan pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] scan_range from 981926c1/default.421929.15_MY_OBJECT
2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] got 0 items, next 981926c1/default.421929.15_MY_OBJECT
2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] []
2016-07-25 15:56:12.694129 7f26724d1700 1 -- 10.54.10.27:6818/913781 --> 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420
2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op 0x3d503600 finish
2016-07-25 15:56:12.682241 7f262e8ed700 1 -- 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226 handle_replica_op pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 epoch 25226
2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226 should_share_map osd.453 10.54.10.28:7010/1375782 25226
2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]
2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] handle_message: pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] do_scan pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] scan_range from 981926c1/default.421929.15_MY_OBJECT
2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] got 0 items, next 981926c1/default.421929.15_MY_OBJECT
2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226 pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c 5635/25223 25224/25225/25001) [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped] []
2016-07-25 15:56:12.694129 7f26724d1700 1 -- 10.54.10.27:6818/913781 --> 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420
2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op 0x3d503600 finish
=== END LOG ===
I'm in the process of digging through the OSD code to understand what's going on here, but I figured I would reach out to the community in the hopes that someone could point me in the right direction. If anyone has seen this before and can recommend a course of action, I'm all ears. And if there's any other information I can provide, please let me know what else would be helpful.
Many thanks to any who can lend a hand or teach a man to fish.
Brian Felton
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com