Re: disk mishap + bad disk and xfs corruption = stuck PG's

Mazzystr <mazzystr@xxxxxxxxx> · Fri, 16 Jun 2017 10:48:27 -0400

I resolved this situation and the results are a testament on how amazing Ceph is.

I had replication factor set to 2 during my data migration to ensure there was enough capacity on the target as I cycled disks from Drobo storage to Ceph storage.  I forgot about that setting.  I started having mech and inevitable failure on a disk so started evacuating objects from that disk via crush reweight osd.  Objects started to fly across osd's.  A bunch of data was moved and some was left to be moved. Impatient me started tinkering on some other part of the cluster and I accidentally "zapped" a disk.  Things were still "ok".  Then the failing disk finally died and dropped out of the filesystem.  I rebooted the node.  The xfs filesystem corrupted.  xfs_repair wiped out inodes and put everything into lost+found.  This was a BIG uh-oh.  Ceph reacted badly and halted with 46 pg's in active+stuck+stale status and filesystem halted.  About 800,000 objects were handled by the pg's. Mother eff!

I ended up dropping the two disks completely from cluster, dropping the pg's that were managing the objects. For filesystem I just simply deleted it.

Dum dum duuuuum!  On a traditional storage system these acts would've been catastrophic to the "volume"!

But not for Ceph.

I readded the good disk to the cluster and added a replacement disk to the cluster.  I recreated the pg's as blank pg's.  This got the Ceph cluster healthy.  For filesystem I rescanned for indexes which an object to inode map, recreated the inodes. The made for a healthy posix filesystem and mounted it up.

Total loss was a few thousand minor fringe files.  My video surveillance archive is 100% intact.

Thanks!
/Chris Callegari

ps... post mortem actions: my pool size got set to 3 since I now have raw capacity to do so.  ;-)

On Fri, Jun 9, 2017 at 5:11 PM, Mazzystr <mazzystr@xxxxxxxxx> wrote:
Well I did bad I just don't know how bad yet.  Before we get into it my critical data is backed up to CrashPlan.  I'd rather not lose all my archive data.  Losing some of the data is ok.
I added a bunch of disks to my ceph cluster so I turned off the cluster and dd'd the raw disks around so that the disks and osd's were ordered by id's on the HBA.  I fat fingered one disk and overwrote it.  Another disk didn't dd correctly... it seems to have not unmounted correctly plus it has some failures according to smartctl.  A repair_xfs command put a whole bunch of data into lost+found.
I brought the cluster up and let it settle down.  The result is 49 stuck pg's and CephFS is halted.

ceph -s is here
ceph osd tree is here
ceph pg dump minus the active pg's is here

OSD-2 is gone with no chance to restore it.

OSD-3 had the xfs corruption.  I have a bunch of /var/lib/ceph/osd/ceph-3/lost+found/blah/DIR_[0-9]+/blah.blah__head_blah.blah files after xfs_repair.  I for looped these files through ceph osd map <pool> $file and it seems they have all been replicated to other OSD's.   It seems to be safe to delete this data.

There are files named [0-9]+ in the top level of /var/lib/ceph/osd/ceph-3/lost+found.  I don't know what to do with these files.

I have a couple questions:
1) can the top level lost+found files be used to recreate the stuck pg's?

2a) can the pg's be dropped and recreated to bring the cluster to a healthy state?
2b) if i do this can CephFS be restored with just partial data loss?  The cephfs documentation isn't quite clear on how to do this.

Thanks for your time and help!
/Chris

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com