Re: xfs corruption, data disaster!

Nick Fisk <nick@xxxxxxxxxx> · Tue, 5 May 2015 17:11:13 +0100

Just another quick question,

Do you know if you RAID Controller is disabling the local disk write caches? 

I'm wondering how this corruption occurred and if this is a problem that is specific to your hardware/software config or is a general Ceph issue that makes it vulnerable to sudden power loss.

Normally write barriers should protect against this sort of thing, but a hardware Raid controller may not be passing a flush all the way down to the disks properly.

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Nick Fisk
> Sent: 05 May 2015 07:46
> To: 'Yujian Peng'; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  xfs corruption, data disaster!
> 
> This is probably similar to what you want to try and do, but also mark those
> failed OSD's as lost as I don't think you will have much luck getting them back
> up and running.
> 
> http://ceph.com/community/incomplete-pgs-oh-my/#more-6845
> 
> The only other option would be if anyone knows a way to rebuild the levelDB
> by indexing the contents of the filestore, but I would suspect it would do
> something similar as well.
> 
> But please get a second opinion before doing anything
> 
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Yujian Peng
> > Sent: 05 May 2015 02:14
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  xfs corruption, data disaster!
> >
> > Emmanuel Florac <eflorac@...> writes:
> >
> > >
> > > Le Mon, 4 May 2015 07:00:32 +0000 (UTC) Yujian Peng
> > > <pengyujian5201314 <at> 126.com> écrivait:
> > >
> > > > I'm encountering a data disaster. I have a ceph cluster with 145 osd.
> > > > The data center had a power problem yesterday, and all of the ceph
> > > > nodes were down. But now I find that 6 disks(xfs) in 4 nodes have
> > > > data corruption. Some disks are unable to mount, and some disks
> > > > have IO errors in syslog. mount: Structure needs cleaning
> > > > 	xfs_log_forece: error 5 returned
> > > > I tried to repair one with xfs_repair -L /dev/sdx1, but the
> > > > ceph-osd reported a leveldb error:
> > > > 	Error initializing leveldb: Corruption: checksum mismatch I
> > > > cannot start the 6 osds and 22 pgs is down.
> > > > This is really a tragedy for me. Can you give me some idea to
> > > > recovery the xfs? Thanks very much!
> > >
> > > For XFS problems, ask the XFS ML: xfs <at> oss.sgi.com
> > >
> > > You didn't give enough details, by far. What version of kernel and
> > > distro are you running? If there were errors, please post extensive
> > > logs. If you have IO errors on some disks, you probably MUST replace
> > > them before going any further.
> > >
> > > Why did you run xfs_repair -L ? Did you try xfs_repair without
> > > options first? Were you running the very very latest version of
> > > xfs_repair
> > > (3.2.2) ?
> > >
> > The OS is ubuntu 12.04.5 with kernel 3.13.0 uname -a Linux ceph19
> > 3.13.0-32- generic #57~precise1-Ubuntu SMP Tue Jul 15 03:51:20 UTC
> > 2014 x86_64
> > x86_64 x86_64 GNU/Linux cat /etc/issue Ubuntu 12.04.5 LTS \n \l
> > xfs_repair - V xfs_repair version 3.1.7 I've tried xfs_repair without
> > options, but it showed me some errors, so I used the -L option.
> > Thanks for your reply!
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com