Re: osd failing to start

Martin Wilderoth <martin.wilderoth@xxxxxxxxxx> · Thu, 14 Jul 2016 12:09:12 +0200

Hello 
I don't really find any hardware problems. I have done disk checks and looked at log files.

Should the osd fail in a core dump if there are hardware problems ?

All my data seems intact I only have: 
HEALTH_ERR 915 pgs are stuck inactive for more than 300 seconds; 915 pgs down; 915 pgs peering; 915 pgs stuck inactive;
I guess its due to the failing osd.

I guess I could remove the osd and add as a new one, but its always interesting to know what's actually wrong.

 /Regards Martin

Best Regards / Vänliga Hälsningar
Martin Wilderoth
VD
Enhagslingan 1B, 187 40 Täby

Direkt: +46 8 473 60 63
Mobil: +46 70 969 09 19
martin.wilderoth@xxxxxxxxxx
www.linserv.se

On 14 July 2016 at 06:14, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
On Thu, Jul 14, 2016 at 06:06:58AM +0200, Martin Wilderoth wrote:

>  Hello,

>

> I have a ceph cluster where the one osd is failng to start. I have been

> upgrading ceph to see if the error dissappered. Now I'm running jewel but I

> still get the  error message.

>

>     -1> 2016-07-13 17:04:22.061384 7fda4d24e700  1 heartbeat_map is_healthy

> 'OSD::osd_tp thread 0x7fda25dd8700' had suicide timed out after 150

This appears to indicate that an OSD thread pool thread (work queue thread)

has failed to complete an operation within the 150 second grace period.

The most likely and common cause for this is hardware failure and I would

therefore suggest you thoroughly check this device and look for indicators in

syslog, dmesg, diagnostics, etc. tat this device may have failed.

--

HTH,

Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com