Re: Double OSD failure (won't start) any recovery options?

Tomasz Kuzemko <tomasz.kuzemko@xxxxxxxxxxxx> · Thu, 30 Jun 2016 09:19:12 +0200

With pool size=3 Ceph still should be able to recover from 2 failed
OSDs. It will however disallow client access to the PGs that have only 1
copy until they are replicated at least min_size times. Such PGs are not
marked as "active".

As to the reason of your problems it seems hardware related. What caused
the OSD to stop before it failed to bring itself back up?
Do you have write cache disabled on journal devices and data disks? You
can check with "smartctl -g wcache /dev/sdX". If not, you should
definitely disable it on journal device. I would strongly suggest to
disable it on data disks too because many devices "lie" about persisting
data.

On 30.06.2016 08:08, XPC Design wrote:
> I've had two osds fail and I'm pretty sure they wont recover from this.
> I'm looking for help trying to get them back online if possible...
> 
> terminate called after throwing an instance of
> 'ceph::buffer::malformed_input'
>   what():  buffer::malformed_input: bad checksum on pg_log_entry_t
> 
> - I'm having this problem (http://pastebin.com/raw/jBp6YgUp) when
> starting my osd.
> - The source code related to this is here:
> https://github.com/badone/ceph/blob/master/src/osd/osd_types.cc#L3422-3433
> - The osd logs are here: http://pastebin.com/raw/PWwA0ae6
> 
> It seems that my osds were corrupted (unknown as to why), while leaving
> no trace of problems in dmesg, smart or anything that xfs_repair could find.
> 
> These two OSD's are 6TB of my 40 TB array (triple replicated) and I'm
> pretty sure I can't recover from it. I will know in about 10 hours
> probably. Does anyone know anything I can try to repair my osds?
> 
> My notes on the situation:
> 
> - It can't find the superblock on first start after a reboot, no idea
> why. It's there, I can see it and it doesn't complain after that.
> - The two osds were bought at the same time and have similar serials,
> but no bad smart stats or dmesg errors relating to them.
> - The host these were installed to had a funky bios that was only
> reporting half the ram it had in it. It doesn't have ECC memory. I have
> since replaced the memory.
> - xfs_repair has been run on both osds, nothing seems to have been found
> by it and the problem  still persists.
> - I have been at HEALTH_OK every day, but overnight scrubbing has been
> uncovering problematic pgs I've had to repair ---- every single night so
> far. This morning was when it went beyond my ability to repair.
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Tomasz Kuzemko
tomasz.kuzemko@xxxxxxxxxxxx

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com