Hi
I'm trying to mark one OSD as down, so we can clean it out and replace
it. It keeps getting medium read errors, so it's bound to fail sooner
rather than later. When I command ceph from the mon to mark the osd
down, it doesn't actually do it. When the service on the osd stops, it
is also marked out and I'm thinking (but perhaps incorrectly?) that it
would be good to keep the OSD down+in, to try to read from it as long as
possible. Why doesn't it get marked down and stay that way when I
command it?
Context: Our cluster is in a bit of a less optimal state (see below),
this is after one of OSD nodes had failed and took a week to get back up
(long story). Due to a seriously unbalanced filling of our OSDs we kept
having to reweight OSDs to keep below the 85% threshold. Several disks
are starting to fail now (they're 4+ years old and failures are expected
to occur more frequently).
I'm open to suggestions to help get us back to health_ok more quickly,
but I think we'll get there eventually anyway...
Cheers
/Simon
----
# ceph -s
cluster:
health: HEALTH_ERR
1 clients failing to respond to cache pressure
1/843763422 objects unfound (0.000%)
noout flag(s) set
14 scrub errors
Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent
Degraded data redundancy: 13795525/7095598195 objects
degraded (0.194%), 13 pgs degraded, 12 pgs undersized
70 pgs not deep-scrubbed in time
65 pgs not scrubbed in time
services:
mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 11h)
mgr: cephmon3(active, since 35h), standbys: cephmon1
mds: 1/1 daemons up, 1 standby
osd: 264 osds: 264 up (since 2m), 264 in (since 75m); 227 remapped pgs
flags noout
rgw: 8 daemons active (4 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 15 pools, 3681 pgs
objects: 843.76M objects, 1.2 PiB
usage: 2.0 PiB used, 847 TiB / 2.8 PiB avail
pgs: 13795525/7095598195 objects degraded (0.194%)
54839263/7095598195 objects misplaced (0.773%)
1/843763422 objects unfound (0.000%)
3374 active+clean
195 active+remapped+backfill_wait
65 active+clean+scrubbing+deep
20 active+remapped+backfilling
11 active+clean+snaptrim
10 active+undersized+degraded+remapped+backfill_wait
2 active+undersized+degraded+remapped+backfilling
2 active+clean+scrubbing
1 active+recovery_unfound+degraded
1 active+clean+inconsistent
progress:
Global Recovery Event (8h)
[==========================..] (remaining: 2h)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx