Re: machine hangs & soft lockups with 10.2.2 / kernel 4.4.0

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Tue, 24 Jan 2017 12:29:02 +0100

linux-stable/Documentation/oops-tracing.txt:
>  8: 'D' if the kernel has died recently, i.e. there was an OOPS or BUG.
> 15: 'L' if a soft lockup has previously occurred on the system.

Your first entry already has D and L... can you try to get the first one
before D or L were flagged?

What your log says without this is just what no longer works as a result
of the problem, but not necessarily the problem itself.

To capture the full log of a dying/dead system, you need to set up
another way of logging, other than the local disk (a dead kernel will
not write to its persistent storage for fear of destroying its
integrity). So you need something like a network logger, or a serial
console logger.

For network, there is a way with the kernel cmdline, which is horribly
documented and I have never managed to get to work and do not
recommend... you only need that method when the machine won't boot, and
still a serial console ought to work. The other network ways include
things like configuring syslog to send the log over the network. I think
it's probably also possibly to simply run a long running "sudo cat
/dev/kmsg | nc ..." command to keep reading the file and send it over
the network.

Peter

On 01/23/17 17:37, Matthew Vernon wrote:
> Hi,
>
> We have a 9-node ceph cluster, running 10.2.2 and kernel 4.4.0 (Ubuntu
> Xenial). We're seeing both machines freezing (nothing in logs on the
> machine, which is entirely unresponsive to anything except the power
> button) and suffering soft lockups.
>
> Has anyone seen similar? Googling hasn't found anything obvious, and
> while ceph repairs itself when a machine is lost, this is obviously
> quite concerning.
>
> I don't have any useful logs from the machines that freeze, but I do
> have logs from the machine that suffered soft lockups - you can see the
> relevant bits of kern.log here:
>
> https://drive.google.com/drive/folders/0B4TV1iNptBAdblJMX1R4ZWI5eGc?usp=sharing
>
> [available compressed and uncompressed]
>
> The cluster was installed with ceph-ansible, and the specs of each node
> are roughly:
>
> Cores: 16 (2 x 8-core Intel E5-2690)
> Memory: 512 GB (16 x32 GB)
> Storage: 2x 120GB SAMSUNG SSD (system disk)
>          2x 2TB NVME cards (ceph journal)
> 	 60x 6TB Toshiba 7200 rpm disks (ceph storage)
> Network: 1 Gbit/s Intel I350 (Control interface)
> 	 2x 100Gbit/s Mellanox cards (bonded together)
>
> We're in pre-production testing, but any suggestions on how we might get
> to the bottom of this would be appreciated!
>
> There's no obvious pattern to these problems, and we've had 2 freezes
> and 1 soft lockup in the last ~1.5 weeks.
>
> Thanks,
>
> Matthew
>
>

-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
Internet: http://www.brockmann-consult.de
--------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com