Re: How can we repair OSD leveldb?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Op 17 augustus 2016 om 23:54 schreef Dan Jakubiec <dan.jakubiec@xxxxxxxxx>:
> 
> 
> Hi Wido,
> 
> Thank you for the response:
> 
> > On Aug 17, 2016, at 16:25, Wido den Hollander <wido@xxxxxxxx> wrote:
> > 
> > 
> >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec <dan.jakubiec@xxxxxxxxx>:
> >> 
> >> 
> >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to all 8 machines.  We've managed to recover the XFS filesystems on 7 of the machines, but the OSD service is only starting on 1 of them.
> >> 
> >> The other 5 machines all have complaints similar to the following:
> >> 
> >> 	2016-08-17 09:32:15.549588 7fa2f4666800 -1 filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb : Corruption: 6 missing files; e.g.: /var/lib/ceph/osd/ceph-1/current/omap/042421.ldb
> >> 
> >> How can we repair the leveldb to allow the OSDs to startup?  
> >> 
> > 
> > My first question would be: How did this happen?
> > 
> > What hardware are you using underneath? Is there a RAID controller which is not flushing properly? Since this should not happen during a power failure.
> > 
> 
> Each OSD drive is connected to an onboard hardware RAID controller and configured in RAID 0 mode as individual virtual disks.  The RAID controller is an LSI 3108.
> 

Was that controller in writeback mode without a BBU?

> I agree -- I am finding it bizarre that 7 of our 8 OSDs (one per machine) did not survive the power outage.  
> 

As Christian already asked, mounted the FS with nobarrier?

> We did have some problems with the stock Ubunut xfs_repair (3.1.9) seg faulting, which eventually we overcame by building a newer version of xfs_repair (4.7.0).  But it did finally repair clean.
> 

Not good. A xfs_repair should not be required after a power failure. A journaling filesystem properly mounted and a good controller underneath should mount and just replay it's journal.

> We actually have some different errors on other OSDs.  A few of them are failing with "Missing map in load_pgs" errors.  But generally speaking it appears to be missing files of various types causing different kinds of failures.
> 

Missing files is not good, very bad actually. This should never happen and points to something which is not Ceph's fault. Controller in writeback, nobarrier mount option, etc.

> I'm really nervous now about the OSD's inability to start with any inconsistencies and no repair utilities (that I can find).  Any advice on how to recover?
> 

I am afraid that you won't be able to recover from this. You are missing essential files from the OSDs. Without them they won't be able to start.

Maybe, maybe, maybe something will be able to reconstruct the leveldb of the other OSDs with data from the one surviving OSD, but that's a very big maybe.

Wido

> > I don't know the answer to your question, but lost files are not good.
> > 
> > You might find them in a lost+found directory if XFS repair worked?
> > 
> 
> Sadly this directory is empty.
> 
> -- Dan
> 
> > Wido
> > 
> >> Thanks,
> >> 
> >> -- Dan J_______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux