Hi,
This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January
2024.
On Monday 22 January we had 4 HDD all on different server with I/O-error
because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5
HDD share 1 SSD.
I set the OSD out, ceph osd out 223 269 290 318 and all hell broke
loose.
I took only minutes before the users complained about Ceph not working.
Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph
tell osd.<id> dump_ops_in_flight” against the out OSDs it just hang,
after 30 minutes I stopped the dump command.
Long story short I ended up running “ceph osd set nobackfill” to slow
ops was gone and then unset it when the slow ops message disappeared.
I needed to run that all the time so the cluster didn’t come to a holt
so this oneliner loop was used
“while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" &&
(date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill);
sleep 10; done”
But now 4 days later the backfilling has stopped progressing completely
and the number of misplaced object is increasing.
Some PG has 0 misplaced object but sill have backfilling state, and been
in this state for over 24 hours now.
I have a hunch that it’s because of PG 404.6e7 is in state
“active+recovering+degraded+remapped” it’s been in this state for over
48 hours.
It’s has possible 2 missing object, but since they are not unfound I
can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete”
Could someone please help to solve this?
Down below is some output of ceph commands, I’ll also attache them.
ceph status (only removed information about no running scrub and
deep_scrub)
---
cluster:
id: b321e76e-da3a-11eb-b75c-4f948441dcd0
health: HEALTH_WARN
Degraded data redundancy: 2/6294904971 objects degraded
(0.000%), 1 pg degraded
services:
mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d)
mgr: ceph-mon-1.ptrsea(active, since 11d), standbys:
ceph-mon-2.mfdanx
mds: 1/1 daemons up, 1 standby
osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped
pgs
rgw: 7 daemons active (7 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 14 pools, 3945 pgs
objects: 1.14G objects, 1.1 PiB
usage: 1.8 PiB used, 1.2 PiB / 3.0 PiB avail
pgs: 2/6294904971 objects degraded (0.000%)
2980455/6294904971 objects misplaced (0.047%)
3901 active+clean
22 active+clean+scrubbing+deep
17 active+remapped+backfilling
4 active+clean+scrubbing
1 active+recovering+degraded+remapped
io:
client: 167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr
ceph health detail (only removed information about no running scrub and
deep_scrub)
---
HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded
(0.000%), 1 pg degraded
[WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects
degraded (0.000%), 1 pg degraded
pg 404.6e7 is active+recovering+degraded+remapped, acting
[223,274,243,290,286,283]
ceph pg 202.6e7 list_unfound
---
{
"num_missing": 2,
"num_unfound": 0,
"objects": [],
"state": "Active",
"available_might_have_unfound": true,
"might_have_unfound": [],
"more": false
}
ceph pg 404.6e7 query | jq .recovery_state
---
[
{
"name": "Started/Primary/Active",
"enter_time": "2024-01-26T09:08:41.918637+0000",
"might_have_unfound": [
{
"osd": "243(2)",
"status": "already probed"
},
{
"osd": "274(1)",
"status": "already probed"
},
{
"osd": "275(0)",
"status": "already probed"
},
{
"osd": "283(5)",
"status": "already probed"
},
{
"osd": "286(4)",
"status": "already probed"
},
{
"osd": "290(3)",
"status": "already probed"
},
{
"osd": "335(3)",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [
"275(0)",
"335(3)"
],
"waiting_on_backfill": [],
"last_backfill_started":
"404:e76011a9:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.18_56463c71-286c-4399-8d5d-0c278b7c97fd:head",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}
}
},
{
"name": "Started",
"enter_time": "2024-01-26T09:08:40.909151+0000"
}
]
ceph pg ls recovering backfilling
---
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES
OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE
SINCE VERSION REPORTED UP
ACTING
404.bc 287986 0 0 0 512046716673
0 0 10091 0 active+recovering+remapped
2h 217988'1385478 217988:10897565 [193,297,279,276,136,197]p193
[223,297,269,276,136,197]p223
404.c4 288236 0 288236 0 511669837559
0 0 10063 0 active+remapped+backfilling
24h 217988'1378228 217988:11719855 [156,186,178,345,339,177]p156
[223,186,178,345,339,177]p223
404.12a 287544 0 0 0 512246100354
0 0 10009 0 active+remapped+backfilling
24h 217988'1392371 217988:13739524 [248,178,250,145,304,272]p248
[223,178,250,145,304,272]p223
404.1c1 287739 0 286969 0 511800674008
0 0 10047 0 active+remapped+backfilling
2d 217988'1402889 217988:10975174 [332,246,183,169,280,255]p332
[318,246,183,169,280,255]p318
404.258 287737 0 277111 0 510099501390
0 0 10077 0 active+remapped+backfilling
24h 217988'1451778 217988:12780104 [308,199,134,342,188,221]p308
[318,199,134,342,188,221]p318
404.269 287990 0 0 0 512343190608
0 0 10043 0 active+remapped+backfilling
24h 217988'1358446 217988:14020217 [275,205,283,247,211,292]p275
[223,205,283,247,211,292]p223
404.34e 287624 0 277899 0 510447074297
0 0 10002 0 active+remapped+backfilling
24h 217988'1392933 217988:12636557 [322,141,338,168,251,218]p322
[318,141,338,168,251,218]p318
404.39c 287844 0 286692 0 512947685682
0 0 10017 0 active+remapped+backfilling
2d 217988'1414697 217988:11004944 [288,188,131,299,295,181]p288
[318,188,131,299,295,181]p318
404.511 287589 0 0 0 512014863711
0 0 10057 0 active+remapped+backfilling
24h 217988'1368741 217988:11544729 [166,151,327,333,186,150]p166
[223,151,327,333,186,150]p223
404.5f1 288126 0 286621 0 510850256945
0 0 10071 0 active+remapped+backfilling
24h 217988'1365831 217988:10348125 [214,332,289,184,255,160]p214
[223,332,289,184,255,160]p223
404.62a 288035 0 0 0 511318662269
0 0 10014 0 active+remapped+backfilling
3h 217988'1358010 217988:12528704 [322,260,259,319,149,152]p322
[318,260,259,319,149,152]p318
404.63d 287372 0 286559 0 508783837699
0 0 10074 0 active+remapped+backfilling
24h 217988'1402174 217988:11685744 [303,307,186,350,161,267]p303
[318,307,186,350,161,267]p318
404.6e3 288110 0 0 0 509047569016
0 0 10049 0 active+remapped+backfilling
24h 217988'1368547 217988:12202278 [166,317,233,144,337,240]p166
[223,317,233,144,337,240]p223
404.6e7 287856 2 2 0 510383394904
0 0 10047 0 active+recovering+degraded+remapped
3h 217988'1356501 217988:13157749 [275,274,243,335,286,283]p275
[223,274,243,290,286,283]p223
404.7d2 287619 0 286026 0 510708533087
0 0 10093 0 active+remapped+backfilling
3d 217988'1397393 217988:12146656 [185,139,299,222,155,149]p185
[223,139,299,222,155,149]p223
412.119 711468 0 0 0 207473602580
0 0 10099 0 active+remapped+backfilling
24h 217988'21613330 217988:87589096 [352,207,292,314,230,262]p352
[318,207,292,314,230,262]p318
412.12f 711529 0 701279 0 208498170310
0 0 10033 0 active+remapped+backfilling
24h 217988'14873593 217988:86198113 [303,305,183,215,130,244]p303
[318,305,183,215,130,244]p318
412.1fb 713044 0 3166 0 207787641403
0 0 10097 0 active+remapped+backfilling
2d 217988'14893270 217988:105346132 [156,137,228,241,262,353]p156
[223,137,228,241,262,353]p223
ceph osd tree out
---
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3112.43481 root default
-67 192.35847 host ceph-hd-001
269 hdd 12.82390 osd.269 up 0 1.00000
-49 192.35847 host ceph-hd-003
223 hdd 12.82390 osd.223 up 0 1.00000
-73 192.35847 host ceph-hd-011
290 hdd 12.82390 osd.290 up 0 1.00000
-79 192.35847 host ceph-hd-014
318 hdd 12.82390 osd.318 up 0
1.00000_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx