Is your min-size at least 2? Is it just one OSD affected? If yes and if it is only the journal that is corrupt, but the actual OSD store is intact although lagging behind now in writes and you do have healthy copies of its PGs elsewhere (hence the min-size requirement) you could resolve this situation by: 1) ensure the OSD with the corrupt journal is stopped 2) recreate the journal 3) start the OSD again. The OSD should peer its PGs and bring them on par with the other copies and the cluster should return to healthy state again. See here ( http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/ ) for a more detailed walkthrough. It talks about failed SSD with journals but the situation is the same with regards to any journal failure. Now you mentioned having set the weight to 0 in the meantime, I have no idea how this is going to affect the above procedure, maybe you should wait for somebody else to comment on this. Hope this helps a bit, -K. On 2017-06-05 15:32, Zigor Ozamiz wrote: > Hi everyone, > > Due to two beginner's big mistakes handling and recovering a hard disk, > we have reached to a situation in which the system tells us that the > journal of an osd is corrupted. > > 2017-05-30 17:59:21.318644 7fa90757a8c0 1 journal _open > /dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV5281013C200MGN-part3 fd 20: > 20480000000 > bytes, block size 4096 bytes, directio = 1, aio = 1 > 2017-05-30 17:59:21.322226 7fa90757a8c0 -1 journal Unable to read past > sequence 3219747309 but header indicates the journal has committed up > through 3219750285, journal is corrupt > 2017-05-30 17:59:21.325946 7fa90757a8c0 -1 os/FileJournal.cc: In > function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, > bool*)' thread 7fa90757a8c0 time 2017-05-30 17:59:21.322296 > os/FileJournal.cc: 1853: FAILED assert(0) > > We think that the only way we can reuse the osd, is wiping and starting > it again. But before doing that, we have lowered the weight by 0 and > waited until the cluster recover itself. Since that moment, several days > have passed but somg pgs still have "stale + active + clean" state. > > pg_stat state up up_primary acting acting_primary > 1.b5 stale+active+clean [0] 0 [0] 0 > 1.22 stale+active+clean [0] 0 [0] 0 > 1.53 stale+active+clean [0] 0 [0] 0 > 1.198 stale+active+clean [0] 0 [0] 0 > 1.199 stale+active+clean [0] 0 [0] 0 > 1.4e stale+active+clean [0] 0 [0] 0 > 1.4f stale+active+clean [0] 0 [0] 0 > 1.a7 stale+active+clean [0] 0 [0] 0 > 1.1ef stale+active+clean [0] 0 [0] 0 > 1.160 stale+active+clean [0] 0 [0] 0 > 18.4 stale+active+clean [0] 0 [0] 0 > 1.15e stale+active+clean [0] 0 [0] 0 > 1.a1 stale+active+clean [0] 0 [0] 0 > 1.18a stale+active+clean [0] 0 [0] 0 > 1.156 stale+active+clean [0] 0 [0] 0 > 1.6b stale+active+clean [0] 0 [0] 0 > 1.c6 stale+active+clean [0] 0 [0] 0 > 1.1b1 stale+active+clean [0] 0 [0] 0 > 1.123 stale+active+clean [0] 0 [0] 0 > 1.17a stale+active+clean [0] 0 [0] 0 > 1.bc stale+active+clean [0] 0 [0] 0 > 1.179 stale+active+clean [0] 0 [0] 0 > 1.177 stale+active+clean [0] 0 [0] 0 > 1.b8 stale+active+clean [0] 0 [0] 0 > 1.2a stale+active+clean [0] 0 [0] 0 > 1.117 stale+active+clean [0] 0 [0] 0 > > When executing a "ceph pg query PGID" or "ceph pg PGID list_missing", we > get the error "Error ENOENT: I do not have pgid PGID". > > Given that we are using replication 3, there is no data loss, isn't it? > How could we proceed to solve the problem? > > - Running: ceph osd lost OSDID; as recommended in some previous > consultation in this list. > - Recreating the pgs by hand via: ceph pg force_create PGID > - Making the wipe directly. > > Thanks in advance, > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com