Re: Hard disk bad manipulation: journal corruption and stale pgs

koukou73gr <koukou73gr@xxxxxxxxx> · Mon, 5 Jun 2017 19:41:32 +0300

Is your min-size at least 2? Is it just one OSD affected?

If yes and if it is only the journal that is corrupt, but the actual OSD
store is intact although lagging behind now in writes and you do have
healthy copies of its PGs elsewhere (hence the min-size requirement) you
could resolve this situation by:

1) ensure the OSD with the corrupt journal is stopped
2) recreate the journal
3) start the OSD again.

The OSD should peer its PGs and bring them on par with the other copies
and the cluster should return to healthy state again.

See here (
http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/
) for a more detailed walkthrough. It talks about failed SSD with
journals but the situation is the same with regards to any journal failure.

Now you mentioned having set the weight to 0 in the meantime, I have no
idea how this is going to affect the above procedure, maybe you should
wait for somebody else to comment on this.

Hope this helps a bit,

-K.

On 2017-06-05 15:32, Zigor Ozamiz wrote:
> Hi everyone,
> 
> Due to two beginner's big mistakes handling and recovering a hard disk,
> we have reached to a situation in which the system tells us that the
> journal of an osd is corrupted.
> 
> 2017-05-30 17:59:21.318644 7fa90757a8c0  1 journal _open
> /dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV5281013C200MGN-part3 fd 20:
> 20480000000
>  bytes, block size 4096 bytes, directio = 1, aio = 1
> 2017-05-30 17:59:21.322226 7fa90757a8c0 -1 journal Unable to read past
> sequence 3219747309 but header indicates the journal has committed up
> through 3219750285, journal is corrupt
> 2017-05-30 17:59:21.325946 7fa90757a8c0 -1 os/FileJournal.cc: In
> function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&,
> bool*)' thread 7fa90757a8c0 time 2017-05-30 17:59:21.322296
> os/FileJournal.cc: 1853: FAILED assert(0)
> 
> We think that the only way we can reuse the osd, is wiping and starting
> it again. But before doing that, we have lowered the weight by 0 and
> waited until the cluster recover itself. Since that moment, several days
> have passed but somg pgs still have "stale + active + clean" state.
> 
> pg_stat    state    up    up_primary    acting    acting_primary
> 1.b5    stale+active+clean    [0]    0    [0]    0
> 1.22    stale+active+clean    [0]    0    [0]    0
> 1.53    stale+active+clean    [0]    0    [0]    0
> 1.198    stale+active+clean    [0]    0    [0]    0
> 1.199    stale+active+clean    [0]    0    [0]    0
> 1.4e    stale+active+clean    [0]    0    [0]    0
> 1.4f    stale+active+clean    [0]    0    [0]    0
> 1.a7    stale+active+clean    [0]    0    [0]    0
> 1.1ef    stale+active+clean    [0]    0    [0]    0
> 1.160    stale+active+clean    [0]    0    [0]    0
> 18.4    stale+active+clean    [0]    0    [0]    0
> 1.15e    stale+active+clean    [0]    0    [0]    0
> 1.a1    stale+active+clean    [0]    0    [0]    0
> 1.18a    stale+active+clean    [0]    0    [0]    0
> 1.156    stale+active+clean    [0]    0    [0]    0
> 1.6b    stale+active+clean    [0]    0    [0]    0
> 1.c6    stale+active+clean    [0]    0    [0]    0
> 1.1b1    stale+active+clean    [0]    0    [0]    0
> 1.123    stale+active+clean    [0]    0    [0]    0
> 1.17a    stale+active+clean    [0]    0    [0]    0
> 1.bc    stale+active+clean    [0]    0    [0]    0
> 1.179    stale+active+clean    [0]    0    [0]    0
> 1.177    stale+active+clean    [0]    0    [0]    0
> 1.b8    stale+active+clean    [0]    0    [0]    0
> 1.2a    stale+active+clean    [0]    0    [0]    0
> 1.117    stale+active+clean    [0]    0    [0]    0
> 
> When executing a "ceph pg query PGID" or "ceph pg PGID list_missing", we
> get the error "Error ENOENT: I do not have pgid PGID".
> 
> Given that we are using replication 3, there is no data loss, isn't it?
> How could we proceed to solve the problem?
> 
> - Running: ceph osd lost OSDID; as recommended in some previous
> consultation in this list.
> - Recreating the pgs by hand via: ceph pg force_create PGID
> - Making the wipe directly.
> 
> Thanks in advance,
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com