Our ceph cluster stopped responding to requests two weeks ago, and I have been trying to fix it since then. After a semi-hard reboot, we had 11-ish OSDs "fail" spread across two hosts, with the pool size set to two. I was able to extract a copy of every PG that resided solely on the nonfunctional OSDs, but the cluster is refusing to let me read from it. I marked all the "failed" OSDs as lost and used ceph pg $pg mark_unfound_lost revert for all the PGs reporting unfound objects, but that didn't help either. ddrescue also breaks, because ceph will never admit that it has lost data and just blocks forever instead of returning a read error.
Is there any way to tell ceph to cut its losses and just let me access my data again?
cluster:
id: 313be153-5e8a-4275-b3aa-caea1ce7bce2
health: HEALTH_ERR
noout,nobackfill,norebalance flag(s) set
2720183/6369036 objects misplaced (42.709%)
9/3184518 objects unfound (0.000%)
39 scrub errors
Reduced data availability: 131 pgs inactive, 16 pgs down, 114 pgs incomplete
Possible data damage: 7 pgs recovery_unfound, 1 pg inconsistent, 7 pgs snaptrim_error
Degraded data redundancy: 1710175/6369036 objects degraded (26.851%), 1069 pgs degraded, 1069 pgs undersized
Degraded data redundancy (low space): 82 pgs backfill_toofull
services:
mon: 1 daemons, quorum waitaha
mgr: waitaha(active)
osd: 43 osds: 34 up, 34 in; 1786 remapped pgs
flags noout,nobackfill,norebalance
data:
pools: 2 pools, 2048 pgs
objects: 3.18 M objects, 8.4 TiB
usage: 21 TiB used, 60 TiB / 82 TiB avail
pgs: 0.049% pgs unknown
6.348% pgs not active
1710175/6369036 objects degraded (26.851%)
2720183/6369036 objects misplaced (42.709%)
9/3184518 objects unfound (0.000%)
987 active+undersized+degraded+remapped+backfill_wait
695 active+remapped+backfill_wait
124 active+clean
114 incomplete
62 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
20 active+remapped+backfill_wait+backfill_toofull
16 down
12 active+undersized+degraded+remapped+backfilling
7 active+recovery_unfound+undersized+degraded+remapped
7 active+clean+snaptrim_error
2 active+remapped+backfilling
1 unknown
1 active+undersized+degraded+remapped+inconsistent+backfill_wait
id: 313be153-5e8a-4275-b3aa-caea1ce7bce2
health: HEALTH_ERR
noout,nobackfill,norebalance flag(s) set
2720183/6369036 objects misplaced (42.709%)
9/3184518 objects unfound (0.000%)
39 scrub errors
Reduced data availability: 131 pgs inactive, 16 pgs down, 114 pgs incomplete
Possible data damage: 7 pgs recovery_unfound, 1 pg inconsistent, 7 pgs snaptrim_error
Degraded data redundancy: 1710175/6369036 objects degraded (26.851%), 1069 pgs degraded, 1069 pgs undersized
Degraded data redundancy (low space): 82 pgs backfill_toofull
services:
mon: 1 daemons, quorum waitaha
mgr: waitaha(active)
osd: 43 osds: 34 up, 34 in; 1786 remapped pgs
flags noout,nobackfill,norebalance
data:
pools: 2 pools, 2048 pgs
objects: 3.18 M objects, 8.4 TiB
usage: 21 TiB used, 60 TiB / 82 TiB avail
pgs: 0.049% pgs unknown
6.348% pgs not active
1710175/6369036 objects degraded (26.851%)
2720183/6369036 objects misplaced (42.709%)
9/3184518 objects unfound (0.000%)
987 active+undersized+degraded+remapped+backfill_wait
695 active+remapped+backfill_wait
124 active+clean
114 incomplete
62 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
20 active+remapped+backfill_wait+backfill_toofull
16 down
12 active+undersized+degraded+remapped+backfilling
7 active+recovery_unfound+undersized+degraded+remapped
7 active+clean+snaptrim_error
2 active+remapped+backfilling
1 unknown
1 active+undersized+degraded+remapped+inconsistent+backfill_wait
Thanks,
Dylan
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com