Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. Before blaming system drivers and continuing to investigate a problem, may I ask if someone faced similar problem? I am using 802.ad on pair intel 350 for general connectivity. I have attached a bit of traces which was pushed to netconsole(in some cases, machine died hardly, e.g. not even sending a final bye over netconsole, so it is not complete).
Attachment:
netcon.log.gz
Description: GNU Zip compressed data