linux-stable/Documentation/oops-tracing.txt: > 8: 'D' if the kernel has died recently, i.e. there was an OOPS or BUG. > 15: 'L' if a soft lockup has previously occurred on the system. Your first entry already has D and L... can you try to get the first one before D or L were flagged? What your log says without this is just what no longer works as a result of the problem, but not necessarily the problem itself. To capture the full log of a dying/dead system, you need to set up another way of logging, other than the local disk (a dead kernel will not write to its persistent storage for fear of destroying its integrity). So you need something like a network logger, or a serial console logger. For network, there is a way with the kernel cmdline, which is horribly documented and I have never managed to get to work and do not recommend... you only need that method when the machine won't boot, and still a serial console ought to work. The other network ways include things like configuring syslog to send the log over the network. I think it's probably also possibly to simply run a long running "sudo cat /dev/kmsg | nc ..." command to keep reading the file and send it over the network. Peter On 01/23/17 17:37, Matthew Vernon wrote: > Hi, > > We have a 9-node ceph cluster, running 10.2.2 and kernel 4.4.0 (Ubuntu > Xenial). We're seeing both machines freezing (nothing in logs on the > machine, which is entirely unresponsive to anything except the power > button) and suffering soft lockups. > > Has anyone seen similar? Googling hasn't found anything obvious, and > while ceph repairs itself when a machine is lost, this is obviously > quite concerning. > > I don't have any useful logs from the machines that freeze, but I do > have logs from the machine that suffered soft lockups - you can see the > relevant bits of kern.log here: > > https://drive.google.com/drive/folders/0B4TV1iNptBAdblJMX1R4ZWI5eGc?usp=sharing > > [available compressed and uncompressed] > > The cluster was installed with ceph-ansible, and the specs of each node > are roughly: > > Cores: 16 (2 x 8-core Intel E5-2690) > Memory: 512 GB (16 x32 GB) > Storage: 2x 120GB SAMSUNG SSD (system disk) > 2x 2TB NVME cards (ceph journal) > 60x 6TB Toshiba 7200 rpm disks (ceph storage) > Network: 1 Gbit/s Intel I350 (Control interface) > 2x 100Gbit/s Mellanox cards (bonded together) > > We're in pre-production testing, but any suggestions on how we might get > to the bottom of this would be appreciated! > > There's no obvious pattern to these problems, and we've had 2 freezes > and 1 soft lockup in the last ~1.5 weeks. > > Thanks, > > Matthew > > -- -------------------------------------------- Peter Maloney Brockmann Consult Max-Planck-Str. 2 21502 Geesthacht Germany Tel: +49 4152 889 300 Fax: +49 4152 889 333 E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx Internet: http://www.brockmann-consult.de -------------------------------------------- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com