Re: Replace corrupt journal

"Sahlstrom, Claes" <csahlstrom@xxxxxxxx> · Mon, 12 Jan 2015 15:07:56 +0000

Thanks for the reply, I have had some more time to mess around more with this now.

I understand that the best thing is to allow it to rebuild the entire OSD, but I am currently only using one replica and 2/3 machines had problems I ended up in a bad situation. With OSDs down on 2 machines and one replica I think I would lose data for certain if I rebuilt them from scratch. Luckily in my case there was no new data being written to the cluster at that time, I only use it as a NAS in my home-lab.

It did work out fine for me this time but I guess anyone reading this should know it is not a recommended way to do things. I got confused because I was reusing a logical volume as journal and I didn´t wipe it properly before I used "--mkjournal", after wiping it properly and then using "--mkjournal" seems to have solved the problem for me.

My only withstanding issue now is one pg that remains inconsistent even after trying to do a repair, besides that everything seems to be fine. I haven´t digged too much into that yet, with only one replica I guess it is ticky to guess which of the replicas that is the broken one.

I will add a note to that ticket, it happened when the power to the server was lost while replicating and I think that is what made two journals corrupt.

Cheers,
Claes

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxxx] 
Sent: den 12 januari 2015 15:46
To: Sahlstrom, Claes
Cc: ceph-users@xxxxxxxx
Subject: Re:  Replace corrupt journal

On Sun, 11 Jan 2015, Sahlstrom, Claes wrote:
> 
> Hi,
> 
>  
> 
> I have a problem starting a couple of OSDs because of the journal 
> being corrupt. Is there any way to replace the journal and keeping the 
> rest of the OSD intact.

It is risky at best... I would not recommend it!  The safe route is to wipe the OSD and let the cluster repair.

>     -1> 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to 
> read past sequence 8188178 but header indicates the journal has 
> committed up through 8188206, journal is corrupt
> 
>      0> 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: 
> In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)'
> thread 7fb32df86900 time 2015-01-11 16:02:54.475276
> 
> os/FileJournal.cc: 1693: FAILED assert(0)

Do you mind making a note that you saw this on this ticket:

	http://tracker.ceph.com/issues/6003

We see it periodically in QA but have never been able to track it down.  
It could also be caused by a hardware issue, so any information about whether the journal device appears damanged would be helpful.

Thanks!
sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com