Hi, We have a 9-node ceph cluster, running 10.2.2 and kernel 4.4.0 (Ubuntu Xenial). We're seeing both machines freezing (nothing in logs on the machine, which is entirely unresponsive to anything except the power button) and suffering soft lockups. Has anyone seen similar? Googling hasn't found anything obvious, and while ceph repairs itself when a machine is lost, this is obviously quite concerning. I don't have any useful logs from the machines that freeze, but I do have logs from the machine that suffered soft lockups - you can see the relevant bits of kern.log here: https://drive.google.com/drive/folders/0B4TV1iNptBAdblJMX1R4ZWI5eGc?usp=sharing [available compressed and uncompressed] The cluster was installed with ceph-ansible, and the specs of each node are roughly: Cores: 16 (2 x 8-core Intel E5-2690) Memory: 512 GB (16 x32 GB) Storage: 2x 120GB SAMSUNG SSD (system disk) 2x 2TB NVME cards (ceph journal) 60x 6TB Toshiba 7200 rpm disks (ceph storage) Network: 1 Gbit/s Intel I350 (Control interface) 2x 100Gbit/s Mellanox cards (bonded together) We're in pre-production testing, but any suggestions on how we might get to the bottom of this would be appreciated! There's no obvious pattern to these problems, and we've had 2 freezes and 1 soft lockup in the last ~1.5 weeks. Thanks, Matthew -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com