Re: machine hangs & soft lockups with 10.2.2 / kernel 4.4.0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/01/17 16:40, Tu Holmes wrote:
> While I know this seems a silly question, are your monitoring nodes
> spec'd the same?

Oh, sorry, I should have said that. All 9 machines have osds on (1 per
disk); additionally 3 of the nodes are also mons and 3 (a different 3)
are rgws.

One of the freezing nodes is osds-only, another is osds-and-mons. The
soft-lockup node is osds-and-rgw

Regards,

Matthew

> //Tu
> On Mon, Jan 23, 2017 at 8:38 AM Matthew Vernon <mv3@xxxxxxxxxxxx
> <mailto:mv3@xxxxxxxxxxxx>> wrote:
> 
>     Hi,
> 
>     We have a 9-node ceph cluster, running 10.2.2 and kernel 4.4.0 (Ubuntu
>     Xenial). We're seeing both machines freezing (nothing in logs on the
>     machine, which is entirely unresponsive to anything except the power
>     button) and suffering soft lockups.
> 
>     Has anyone seen similar? Googling hasn't found anything obvious, and
>     while ceph repairs itself when a machine is lost, this is obviously
>     quite concerning.
> 
>     I don't have any useful logs from the machines that freeze, but I do
>     have logs from the machine that suffered soft lockups - you can see the
>     relevant bits of kern.log here:
> 
>     https://drive.google.com/drive/folders/0B4TV1iNptBAdblJMX1R4ZWI5eGc?usp=sharing
> 
>     [available compressed and uncompressed]
> 
>     The cluster was installed with ceph-ansible, and the specs of each node
>     are roughly:
> 
>     Cores: 16 (2 x 8-core Intel E5-2690)
>     Memory: 512 GB (16 x32 GB)
>     Storage: 2x 120GB SAMSUNG SSD (system disk)
>              2x 2TB NVME cards (ceph journal)
>              60x 6TB Toshiba 7200 rpm disks (ceph storage)
>     Network: 1 Gbit/s Intel I350 (Control interface)
>              2x 100Gbit/s Mellanox cards (bonded together)
> 
>     We're in pre-production testing, but any suggestions on how we might get
>     to the bottom of this would be appreciated!
> 
>     There's no obvious pattern to these problems, and we've had 2 freezes
>     and 1 soft lockup in the last ~1.5 weeks.
> 
>     Thanks,
> 
>     Matthew
> 
> 
>     --
>      The Wellcome Trust Sanger Institute is operated by Genome Research
>      Limited, a charity registered in England with number 1021457 and a
>      company registered in England with number 2742969, whose registered
>      office is 215 Euston Road, London, NW1 2BE.
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux