ceph osd down doesn't seem to work

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



I'm trying to mark one OSD as down, so we can clean it out and replace it. It keeps getting medium read errors, so it's bound to fail sooner rather than later. When I command ceph from the mon to mark the osd down, it doesn't actually do it. When the service on the osd stops, it is also marked out and I'm thinking (but perhaps incorrectly?) that it would be good to keep the OSD down+in, to try to read from it as long as possible. Why doesn't it get marked down and stay that way when I command it?

Context: Our cluster is in a bit of a less optimal state (see below), this is after one of OSD nodes had failed and took a week to get back up (long story). Due to a seriously unbalanced filling of our OSDs we kept having to reweight OSDs to keep below the 85% threshold. Several disks are starting to fail now (they're 4+ years old and failures are expected to occur more frequently).

I'm open to suggestions to help get us back to health_ok more quickly, but I think we'll get there eventually anyway...




# ceph -s
    health: HEALTH_ERR
            1 clients failing to respond to cache pressure
            1/843763422 objects unfound (0.000%)
            noout flag(s) set
            14 scrub errors
            Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent
Degraded data redundancy: 13795525/7095598195 objects degraded (0.194%), 13 pgs degraded, 12 pgs undersized
            70 pgs not deep-scrubbed in time
            65 pgs not scrubbed in time

    mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 11h)
    mgr: cephmon3(active, since 35h), standbys: cephmon1
    mds: 1/1 daemons up, 1 standby
    osd: 264 osds: 264 up (since 2m), 264 in (since 75m); 227 remapped pgs
         flags noout
    rgw: 8 daemons active (4 hosts, 1 zones)

    volumes: 1/1 healthy
    pools:   15 pools, 3681 pgs
    objects: 843.76M objects, 1.2 PiB
    usage:   2.0 PiB used, 847 TiB / 2.8 PiB avail
    pgs:     13795525/7095598195 objects degraded (0.194%)
             54839263/7095598195 objects misplaced (0.773%)
             1/843763422 objects unfound (0.000%)
             3374 active+clean
             195  active+remapped+backfill_wait
             65   active+clean+scrubbing+deep
             20   active+remapped+backfilling
             11   active+clean+snaptrim
             10   active+undersized+degraded+remapped+backfill_wait
             2    active+undersized+degraded+remapped+backfilling
             2    active+clean+scrubbing
             1    active+recovery_unfound+degraded
             1    active+clean+inconsistent

    Global Recovery Event (8h)
      [==========================..] (remaining: 2h)
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux