On 23/01/17 16:40, Tu Holmes wrote: > While I know this seems a silly question, are your monitoring nodes > spec'd the same? Oh, sorry, I should have said that. All 9 machines have osds on (1 per disk); additionally 3 of the nodes are also mons and 3 (a different 3) are rgws. One of the freezing nodes is osds-only, another is osds-and-mons. The soft-lockup node is osds-and-rgw Regards, Matthew > //Tu > On Mon, Jan 23, 2017 at 8:38 AM Matthew Vernon <mv3@xxxxxxxxxxxx > <mailto:mv3@xxxxxxxxxxxx>> wrote: > > Hi, > > We have a 9-node ceph cluster, running 10.2.2 and kernel 4.4.0 (Ubuntu > Xenial). We're seeing both machines freezing (nothing in logs on the > machine, which is entirely unresponsive to anything except the power > button) and suffering soft lockups. > > Has anyone seen similar? Googling hasn't found anything obvious, and > while ceph repairs itself when a machine is lost, this is obviously > quite concerning. > > I don't have any useful logs from the machines that freeze, but I do > have logs from the machine that suffered soft lockups - you can see the > relevant bits of kern.log here: > > https://drive.google.com/drive/folders/0B4TV1iNptBAdblJMX1R4ZWI5eGc?usp=sharing > > [available compressed and uncompressed] > > The cluster was installed with ceph-ansible, and the specs of each node > are roughly: > > Cores: 16 (2 x 8-core Intel E5-2690) > Memory: 512 GB (16 x32 GB) > Storage: 2x 120GB SAMSUNG SSD (system disk) > 2x 2TB NVME cards (ceph journal) > 60x 6TB Toshiba 7200 rpm disks (ceph storage) > Network: 1 Gbit/s Intel I350 (Control interface) > 2x 100Gbit/s Mellanox cards (bonded together) > > We're in pre-production testing, but any suggestions on how we might get > to the bottom of this would be appreciated! > > There's no obvious pattern to these problems, and we've had 2 freezes > and 1 soft lockup in the last ~1.5 weeks. > > Thanks, > > Matthew > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com