Re: Giant upgrade - stability issues

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Thu, 27 Nov 2014 13:05:30 +0000 (GMT)

Sam,

I've done more network testing, this time over 2 days and I believe I have enough evidence to conclude that the osd disconnects are not caused by the network. I have ran about 140 million TCP connects on each osd and host server over the course of about two days. Generating about 800-900 connections per seconds. I've not had a single error/packet drop and the latency / standard deviation was very minimal. 

While the tests were running I did see a number of osds being marked as down by other osds. According to the logs it happened at least 3 times in the course of two days. However, this time the cluster IO was available. The osds simply connected back with the message that they were wrongly marked down.

I was not able to set the full debug logging on the cluster as it would have consumed the disk space in less than 30 mins. So I am not really sure how to debug this particular problem.

What I have done is I have rebooted both osd servers and so far I've not see any osd disconnects. The servers are up 3 days already. Perhaps the problem could be down to the kernel stability, but if this was the case, I would have seen similar issues on Firefly, which I did not. Not sure what to think now.

Andrei
From: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
To: sjust@xxxxxxxxxx
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Thursday, 20 November, 2014 4:50:21 PM
Subject: Re:  Giant upgrade - stability issues

Thanks, I will try that.

Andrei
From: "Samuel Just" <sam.just@xxxxxxxxxxx>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Thursday, 20 November, 2014 4:26:00 PM
Subject: Re:  Giant upgrade - stability issues

You can try to capture logging at

debug osd = 20
debug ms = 20
debug filestore = 20

while an osd is misbehaving.
-Sam

On Thu, Nov 20, 2014 at 7:34 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
> Sam,
>
> further to your email I have done the following:
>
> 1. Upgraded both osd servers with the latest updates and restarted each
> server in turn
> 2. fired up nping utility to generate TCP connections (3 way handshake) from
> each of the servers as well as from the host servers.  In total i've ran 5
> tests. The nping utility was establishing connects on port 22 (as all
> servers have this port open) with the delay of 1ms.  The command used to
> generate the traffic was as follows:
>
> nping --tcp-connect -p 22 --delay 1ms <hostname> -v2 -c 36000000 | gzip
>>/root/nping-hostname-output.gz
>
> The tests took just over 12 hours to complete. The results did not show any
> problems as far as I can see. Here is the tailed output of one of the
> findings:
>
>
> SENT (37825.7303s) Starting TCP Handshake > arh-ibstorage1-ib:22
> (192.168.168.200:22)
> RECV (37825.7303s) Handshake with arh-ibstorage1-ib:22 (192.168.168.200:22)
> completed
>
> Max rtt: 4.447ms | Min rtt: 0.008ms | Avg rtt: 0.008ms
> TCP connection attempts: 36000000 | Successful connections: 36000000 |
> Failed: 0 (0.00%)
> Tx time: 37825.72833s | Tx bytes/s: 76138.65 | Tx pkts/s: 951.73
> Rx time: 37825.72939s | Rx bytes/s: 38069.33 | Rx pkts/s: 951.73
> Nping done: 1 IP address pinged in 37844.55 seconds
>
>
> As you can see from the above, there are no failed connects at all from the
> 36 million established connections. The average delay is 0.008ms and it was
> sending on average almost 1000 packets per second. I've got the same results
> from other servers.
>
> Unless you have other tests in mind, I think there are no issues with the
> network.
>
> I fire up another test for 24 hours this time to see if it makes a
> difference.
>
> Thanks
>
> Andrei
>
>
> ________________________________
> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Wednesday, 19 November, 2014 9:45:40 PM
>
> Subject: Re: [ceph-users] Giant upgrade - stability issues
>
> Well, the heartbeats are failing due to networking errors preventing
> the heartbeats from arriving.  That is causing osds to go down, and
> that is causing pgs to become degraded.  You'll have to work out what
> is preventing the tcp connections from being stable.
> -Sam
>
> On Wed, Nov 19, 2014 at 1:39 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
> wrote:
>>
>>>You indicated that osd 12 and 16 were the ones marked down, but it
>>>looks like only 0,1,2,3,7 were marked down in the ceph.log you sent.
>>>The logs for 12 and 16 did indicate that they had been partitioned
>>>from the other nodes.  I'd bet that you are having intermittent
>>>network trouble since the heartbeats are intermittently failing.
>>>-Sam
>>
>> AM: I will check the logs further for the osds 12 and 16. Perhaps I've
>> missed something, but the ceph osd tree output was showing 12 and 16 as
>> down.
>>
>> Regarding the failure of heartbeats, Wido has suggested that I should
>> investigate the reason for it's failure. The obvious thing to look at is
>> the
>> network and this is what I've initially done. However, I do not see any
>> signs of the network issues. There are no errors on the physical interface
>> and ifconfig is showing a very small number of TX dropped packets
>> (0.00006%)
>> and 0 errors:
>>
>>
>> # ifconfig ib0
>> ib0       Link encap:UNSPEC  HWaddr
>> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>>           inet addr:192.168.168.200  Bcast:192.168.168.255
>> Mask:255.255.255.0
>>           inet6 addr: fe80::223:7dff:ff94:e2a5/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>>           RX packets:1812895801 errors:0 dropped:52 overruns:0 frame:0
>>           TX packets:1835002992 errors:0 dropped:1037 overruns:0 carrier:0
>>           collisions:0 txqueuelen:2048
>>           RX bytes:6252740293262 (6.2 TB)  TX bytes:11343307665152 (11.3
>> TB)
>>
>>
>> How would I investigate what is happening with the hearbeats and the
>> reason
>> for their failures? I have a suspetion that this will solve the issues
>> with
>> frequent reporting of degraded PGs on the cluster and intermittent high
>> levels of IO wait on vms.
>>
>> And also, as i've previously mentioned, the issues started to happen after
>> the upgrade to Giant. I've not had these problems with Firefly, Emperor or
>> Dumpling releases on the same hardware and same cluster loads.
>>
>> Thanks
>>
>> Andrei
>>
>>
>>
>>
>> On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
>> wrote:
>>> Sam,
>>>
>>> Pastebin or similar will not take tens of megabytes worth of logs. If we
>>> are
>>> talking about debug_ms 10 setting, I've got about 7gb worth of logs
>>> generated every half an hour or so. Not really sure what to do with that
>>> much data. Anything more constructive?
>>>
>>> Thanks
>>> ________________________________
>>> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
>>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Sent: Tuesday, 18 November, 2014 8:53:47 PM
>>>
>>> Subject: Re: [ceph-users] Giant upgrade - stability issues
>>>
>>> pastebin or something, probably.
>>> -Sam
>>>
>>> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
>>> wrote:
>>>> Sam, the logs are rather large in size. Where should I post it to?
>>>>
>>>> Thanks
>>>> ________________________________
>>>> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
>>>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
>>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>>> Sent: Tuesday, 18 November, 2014 7:54:56 PM
>>>> Subject: Re:  Giant upgrade - stability issues
>>>>
>>>>
>>>> Ok, why is ceph marking osds down?  Post your ceph.log from one of the
>>>> problematic periods.
>>>> -Sam
>>>>
>>>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
>>>> wrote:
>>>>> Hello cephers,
>>>>>
>>>>> I need your help and suggestion on what is going on with my cluster. A
>>>>> few
>>>>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>>>>> about
>>>>> having issues with Giant where in two weeks period the cluster's IO
>>>>> froze
>>>>> three times after ceph down-ed two osds. I have in total just 17 osds
>>>>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>>>>> with
>>>>> latest updates.
>>>>>
>>>>> I've got zabbix agents monitoring the osd servers and the cluster. I
>>>>> get
>>>>> alerts of any issues, such as problems with PGs, etc. Since upgrading
>>>>> to
>>>>> Giant, I am now frequently seeing emails alerting of the cluster having
>>>>> degraded PGs. I am getting around 10-15 such emails per day stating
>>>>> that
>>>>> the
>>>>> cluster has degraded PGs. The number of degraded PGs very between a
>>>>> couple
>>>>> of PGs to over a thousand. After several minutes the cluster repairs
>>>>> itself.
>>>>> The total number of PGs in the cluster is 4412 between all the pools.
>>>>>
>>>>> I am also seeing more alerts from vms stating that there is a high IO
>>>>> wait
>>>>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>>>>
>>>>> This has not happened on Firefly or the previous releases of ceph. Not
>>>>> much
>>>>> has changed in the cluster since the upgrade to Giant. Networking and
>>>>> hardware is still the same and it is still running the same version of
>>>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>>>>> issues
>>>>> above are related to the upgrade of ceph to Giant.
>>>>>
>>>>> Here is the ceph.conf that I use:
>>>>>
>>>>> [global]
>>>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>>>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib,
>>>>> arh-cloud13-ib
>>>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>>>>> auth_supported = cephx
>>>>> osd_journal_size = 10240
>>>>> filestore_xattr_use_omap = true
>>>>> public_network = 192.168.168.0/24
>>>>> rbd_default_format = 2
>>>>> osd_recovery_max_chunk = 8388608
>>>>> osd_recovery_op_priority = 1
>>>>> osd_max_backfills = 1
>>>>> osd_recovery_max_active = 1
>>>>> osd_recovery_threads = 1
>>>>> filestore_max_sync_interval = 15
>>>>> filestore_op_threads = 8
>>>>> filestore_merge_threshold = 40
>>>>> filestore_split_multiple = 8
>>>>> osd_disk_threads = 8
>>>>> osd_op_threads = 8
>>>>> osd_pool_default_pg_num = 1024
>>>>> osd_pool_default_pgp_num = 1024
>>>>> osd_crush_update_on_start = false
>>>>>
>>>>> [client]
>>>>> rbd_cache = true
>>>>> admin_socket = /var/run/ceph/$name.$pid.asok
>>>>>
>>>>>
>>>>> I would like to get to the bottom of these issues. Not sure if the
>>>>> issues
>>>>> could be fixed with changing some settings in ceph.conf or a full
>>>>> downgrade
>>>>> back to the Firefly. Is the downgrade even possible on a production
>>>>> cluster?
>>>>>
>>>>> Thanks for your help
>>>>>
>>>>> Andrei
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com