Hi everyone, Due to two beginner's big mistakes handling and recovering a hard disk, we have reached to a situation in which the system tells us that the journal of an osd is corrupted. 2017-05-30 17:59:21.318644 7fa90757a8c0 1 journal _open /dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV5281013C200MGN-part3 fd 20: 20480000000 bytes, block size 4096 bytes, directio = 1, aio = 1 2017-05-30 17:59:21.322226 7fa90757a8c0 -1 journal Unable to read past sequence 3219747309 but header indicates the journal has committed up through 3219750285, journal is corrupt 2017-05-30 17:59:21.325946 7fa90757a8c0 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7fa90757a8c0 time 2017-05-30 17:59:21.322296 os/FileJournal.cc: 1853: FAILED assert(0) We think that the only way we can reuse the osd, is wiping and starting it again. But before doing that, we have lowered the weight by 0 and waited until the cluster recover itself. Since that moment, several days have passed but somg pgs still have "stale + active + clean" state. pg_stat state up up_primary acting acting_primary 1.b5 stale+active+clean [0] 0 [0] 0 1.22 stale+active+clean [0] 0 [0] 0 1.53 stale+active+clean [0] 0 [0] 0 1.198 stale+active+clean [0] 0 [0] 0 1.199 stale+active+clean [0] 0 [0] 0 1.4e stale+active+clean [0] 0 [0] 0 1.4f stale+active+clean [0] 0 [0] 0 1.a7 stale+active+clean [0] 0 [0] 0 1.1ef stale+active+clean [0] 0 [0] 0 1.160 stale+active+clean [0] 0 [0] 0 18.4 stale+active+clean [0] 0 [0] 0 1.15e stale+active+clean [0] 0 [0] 0 1.a1 stale+active+clean [0] 0 [0] 0 1.18a stale+active+clean [0] 0 [0] 0 1.156 stale+active+clean [0] 0 [0] 0 1.6b stale+active+clean [0] 0 [0] 0 1.c6 stale+active+clean [0] 0 [0] 0 1.1b1 stale+active+clean [0] 0 [0] 0 1.123 stale+active+clean [0] 0 [0] 0 1.17a stale+active+clean [0] 0 [0] 0 1.bc stale+active+clean [0] 0 [0] 0 1.179 stale+active+clean [0] 0 [0] 0 1.177 stale+active+clean [0] 0 [0] 0 1.b8 stale+active+clean [0] 0 [0] 0 1.2a stale+active+clean [0] 0 [0] 0 1.117 stale+active+clean [0] 0 [0] 0 When executing a "ceph pg query PGID" or "ceph pg PGID list_missing", we get the error "Error ENOENT: I do not have pgid PGID". Given that we are using replication 3, there is no data loss, isn't it? How could we proceed to solve the problem? - Running: ceph osd lost OSDID; as recommended in some previous consultation in this list. - Recreating the pgs by hand via: ceph pg force_create PGID - Making the wipe directly. Thanks in advance, -- Zigor Ozamiz _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com