//Tu
On Mon, Jan 23, 2017 at 8:38 AM Matthew Vernon <mv3@xxxxxxxxxxxx> wrote:
Hi,
We have a 9-node ceph cluster, running 10.2.2 and kernel 4.4.0 (Ubuntu
Xenial). We're seeing both machines freezing (nothing in logs on the
machine, which is entirely unresponsive to anything except the power
button) and suffering soft lockups.
Has anyone seen similar? Googling hasn't found anything obvious, and
while ceph repairs itself when a machine is lost, this is obviously
quite concerning.
I don't have any useful logs from the machines that freeze, but I do
have logs from the machine that suffered soft lockups - you can see the
relevant bits of kern.log here:
https://drive.google.com/drive/folders/0B4TV1iNptBAdblJMX1R4ZWI5eGc?usp=sharing
[available compressed and uncompressed]
The cluster was installed with ceph-ansible, and the specs of each node
are roughly:
Cores: 16 (2 x 8-core Intel E5-2690)
Memory: 512 GB (16 x32 GB)
Storage: 2x 120GB SAMSUNG SSD (system disk)
2x 2TB NVME cards (ceph journal)
60x 6TB Toshiba 7200 rpm disks (ceph storage)
Network: 1 Gbit/s Intel I350 (Control interface)
2x 100Gbit/s Mellanox cards (bonded together)
We're in pre-production testing, but any suggestions on how we might get
to the bottom of this would be appreciated!
There's no obvious pattern to these problems, and we've had 2 freezes
and 1 soft lockup in the last ~1.5 weeks.
Thanks,
Matthew
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com