On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: > Hi, > This is completely off-list, but I`m asking because only ceph trigger > such a bug :) . > > With 0.51, following happens: if I kill an osd, one or more neighbor > nodes may go to hanged state with cpu lockups, not related to > temperature or overall interrupt count or la and it happens randomly > over 16-node cluster. Almost sure that ceph triggerizing some hardware > bug, but I don`t quite sure of which origin. Also after a short time > after reset from such crash a new lockup may be created by any action. >From the log, it looks like your ethernet driver is crapping out. [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out ... [172517.058622] [<ffffffff812b2975>] ? netif_tx_lock+0x40/0x76 etc. The later oopses are talking about paravirt_write_msr etc, which makes me thing you're using Xen? You probably don't want to run Ceph servers inside virtualization (for production). [172696.503900] [<ffffffff8100d025>] ? paravirt_write_msr+0xb/0xe [172696.503942] [<ffffffff810325f3>] ? leave_mm+0x3e/0x3e and *then* you get [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0 [172695.041745] megasas: [ 0]waiting for 35 commands to complete [172696.045602] megaraid_sas: no pending cmds after reset [172696.045644] megasas: reset successful which just adds more awesomeness to the soup -- though I do wonder if this could be caused by the soft hang from earlier. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html