Hammer broke after adding 3rd osd server

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Tue, 26 Apr 2016 16:52:27 +0100 (BST)

Hello everyone,

I've recently performed a hardware upgrade on our small two osd server ceph cluster, which seems to have broke the ceph cluster. We are using ceph for cloudstack rbd images for vms.All of our servers are Ubuntu 14.04 LTS with latest updates and kernel 4.4.6 from ubuntu repo.

Previous hardware:

2 x osd servers with 9 sas osds, 32gb ram and 12 core Intel cpu 2620 @ 2Ghz each and 2 consumer SSDs for journal. Infiniband 40gbit/s networking using IPoIB.

The following things were upgraded:

1. journal ssds were upgraded from consumer ssd to Intel 3710 200gb. We now have 5 osds per single ssd.
2. added additional osd server with 64gb ram, 10 osds, Intel 2670 cpu @ 2.6Ghz
3. Upgraded ram on osd servers to become 64gb
4. Installed additional osd disk to have 10 osds per server.

After adding the third osd server and finishing the initial sync, the cluster worked okay for 1-2 days. No issues were noticed. On a third day my monitoring system started reporting a bunch of issues from the ceph cluster as well as from our virtual machines. This tend to happen between 7:20am and 7:40am and lasts for about 2-3 hours before things become normal again. I've checked the osd servers and there is nothing that I could find in cron or otherwise that starts around 7:20am.

The problem is as follows: the new osd server's load goes to 400+ with ceph-osd processes consuming all cpu resources. The ceph -w shows a high number of slow requests which relate to osds belonging to the new osd server. The log files show the following:

2016-04-20 07:39:04.346459 osd.7 192.168.168.200:6813/2650 2 : cluster [WRN] slow request 30.032033 seconds old, received at 2016-04-20 07:38:34.314014: osd_op(client.140476549.0:13203438 rbd_data.2c9de71520eedd1.0000000000000621 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2572288~4096] 5.6c3bece2 ack+ondisk+write+known_if_redirected e83912) currently waiting for subops from 22
2016-04-20 07:39:04.346465 osd.7 192.168.168.200:6813/2650 3 : cluster [WRN] slow request 30.031878 seconds old, received at 2016-04-20 07:38:34.314169: osd_op(client.140476549.0:13203439 rbd_data.2c9de71520eedd1.0000000000000621 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1101824~8192] 5.6c3bece2 ack+ondisk+write+known_if_redirected e83912) currently waiting for rw locks

There are practically every osd involved in the slow requests and they tend to be between the old two osd servers and the new one. There were no issues as far as I can see between the old two servers.

The first thing i've checked is the networking. No issue was identified from running ping -i .1 <servername> as well as using hping3 for the tcp connection checks. The network tests were running for over a week and not a single packet was lost. The slow requests took place while the network tests were running.

I've also checked the osd and ssd disks and I was not able to identify anything problematic.

Stopping all osds on the new server causes no issues between the old two osd servers. I've left the new server disconnected for a few days and had no issues with the cluster.

I am a bit lost on what else to try and how to debug the issue. Could someone please help me?

Many thanks

Andrei

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com